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M-estimates  of  regression  parameters  are  found  by  minimizing  the  sum 
of  a  function  of  the  difference  between  observed  and  predicted  values  of 
a  dependent  variable.  The  choice  of  a  particular  function  before  the  data 
have  been  examined  is  shown  to  have  serious  consequences  for  the  asymptotic 
variance  of  the  parameter  estimates.  Previous  adaptive  M-estimates  used 
one  of  a  small  number  of  functions  selected  after  preliminary  examination 
of  the  data. 

Continuously  adaptive  M-estimation  (CAM)  is  introduced  to  choose  a 
function  according  to  maximum  likelihood  criteria  from  a  continuous  class 
of  functions,  thereby  simultaneously  estimating  the  regression  parameters 
and  the  underlying  error  density.  Algorithms  for  calculating  the  estimates 
are  derived  and  numerical  examples  demonstrate  the  method's  performance  in 
a  variety  of  regression  problems,  including  symmetric  and  asymmetric  errors, 


CHAPTER  1 
CONTINUOUSLY  ADAPTIVE  M-ESTIMATION  IN  THE  LINEAR  MODEL 


Consider  the  multiple  linear  regression  model 


(1.1)  y^ 


?Q  +  B^x.^  +  e2^•2  -"  ■■■  ^  h^^k  ^  ^• 


i  =  1 , . . . ,  n 


where  the  A-  are  independent  and  identically  distributed  with  mean  zero. 
The  distribution  of  the  A-  will  be  referred  to  as  the  error  distribution. 


To  estimate  B  =  (Bq,  B-j  , 


,Br)'  as  a  linear  transformation  of 


Y  =  {yi,---,y   )'  one  would  use  the  method  of  least  squares.  The  Gauss- 
Markov  Theorem  (Rao,  1973)  insures  that  the  least  squares  estimator  of 
B  is  the  best  linear  unbiased  estimator  of  B  for  the  model  (1.1)  when 
A.  are  iid.  A  least  squares  estimator  of  B  is  a  vector  B  which  minimizes 
the  sum  of  squares  of  the  residuals,  that  is, 

n 


(1.2)    B  e  {  Y  e  R 


k+1 


I   (y-  -  x'y)^     "is  minimum  } 
i-1    ^   ~^~ 


where  x.  =  (1,  x.,  ,x.,, . . . ,x.,  )'.  Let  X  be  an  n  x  (k  +  1)  matrix  whose 


..^,-•2. 


^ik 


ith  row  is  x^  (i  =  l,...,n).  If  X  is  of  full  column  rank,  then  B  is 
unique. 

When  A.  are  iid  normal,  the  least  squares  estimator  of  B  1s  also 

the  maximum  likelihood  estimator.  The  maximum  likelihood  estimator  of 

n 

B  is  a  vector  B  which  maximizes  the  likelihood  function  L  =  n  f(y- 

i  =  l    ^ 
x'b).  where  f  is  the  density  of  A..  Now  maximizing  L  is  equivalent  to 


mi 


nimizing  -In  L,  since  the  natural  logarithm  is  a  monotonic  function. 


(1.3)     -In  L  =  -  I   In  f(y  -  x^b) 

i=l       ^   ~'~ 


1 


or,  since  f(A.)   =   (l//2-fra2)   exp(-A?/2a")   for  A.   normal  with  mean  0,  mini- 
mizing  (1.3)   is  equivalent  to  minimizing 

(1.4)  o^      I      (y-   -   x'3)' 


If  o^   is  known,  and  A.  are  iid  normal,  then  maximizing  L  is  equiva- 
lent to  minimizing  the  sums  of  squares  of  the  residuals. 

But  if  A  are  iid  and  not  normally  distributed,  then  least  squares 
will  no  longer  be  the  maximum  likelihood  solution.  In  general,  a  maximum 
likelihood  estimator  of  3  will  be  a  vector  3*  which  minimizes 

(1.5)  -In  L  =  I     p(y.  -  x:3) 

with  respect  to  3,  where  p  =  -In  f.  If  p  is  strictly  convex,  and  X  is 
full  rank,  then  3*  is  unique. 

For  continuous  and  differentiable  p,  this  minimum  can  be  found  by 
differentiating  -In  L  with  respect  to  3  and  setting  the  result  equal  to 
zero.  We  will  denote  the  derivative,  p',  of  p,  by  i!;,     A  maximum  likeli- 
hood estimate  can  be  found  as  a  solution  of 

n 

(1.6)  I     ^{y.   -   x:3)x..  =0       j  =  0,l,...,k 
i=l    ^   ~^~  ^^ 

where  x.q  =  1,  i  -   l,...,n,  and  where  ip  =  pT 

Since  p  =  -In  f,  and  ^   =  p',  we  can  express  ^   in  terms  of  f  as 
i)  =   -f"/f.  Thus,  when  f  is  known,  the  maximum  likelihood  estimator  of  3 
can  be  found  by  solving  the  k  +  1  equations  in  (1.5)  using  ip  -   -f"/f. 

Note  that  in  the  case  of  the  normal  density,  i|j(A)  =  h/a^,   and  the 
resulting  equations  are  known  as  the  normal  equations. 

If  the  error  distribution  is  known  to  be  double  exponential,  then 
p(A)  =  |a|,  i]j(A)  =  sign(A)  and  the  resulting  method  is  known  as  least 
absolute  value  (LAV)  regression  (Gentle,  1978). 


Equation  (1.6)  applies  only  when  \p   is  homogeneous  of  degree  one, 
that  is,  when  i|j(c-A)  =  c-<^(A).  In  this  case,  jJj  will  be  unaffected  by  the 
scale  of  the  error  distribution.  Both  OLS  and  LAV  estimation  have  homo- 
geneous ii   functions.  Many  ip  functions  proposed  are  not  scale  invariant. 
For  these  techniques,  Equation  (1.6)  must  be  modified  to  incorporate  an 
estimate  of  scale,  say  s,  in  which  case,  one  must  solve 

n    y.  -  x^B 

(1.7)  I       ^(^   1   ~i~)  X   =0        j  =  0,1,. ..,k 
1=1      ^      ^J 

Estimates  of  scale  are  not  independent  of  estimates  of  g  in  this 

case.     In  addition,  since  a  goal   of  this  form  of  estimation  is  to  allow 

departures  from  normality  for  the  error  process,  a  scale  estimate  based 

on  a  sum     of  squares  may  not  be  appropriate.     Estimators  of  scale  used 

in  Andrews  et  al.    (1972),  where  r.   =  y.   -   xrg*     include 

(1.8)  s  =   {   interquartile   range  of  r.    }/1.35 

(1.9)  s  =  median    {  |r.   -  median   {r.}|  }/.6754 

(1.10)  s'=  J_      y      ^2 

n-1      ^^       1 

Table  1.1  gives  t|;  functions  and  references  for  each.  These  are 
presented  without  reference  to  the  scale  estimate  which  may  be  necessary 
in  their  use. 

Use  of  scale  estimates  introduces  a  problem  in  numerical  solution 
of  Equation  (1.7).  Techniques  for  the  solution  of  these  equations  often 
involve  iteration  through  a  succession  of  parameter  vector  estimates.  At 
each  iteration,  the  current  estimate  can  be  used  to  get  a  current  estimate 
of  r^. ,  which  can  then  be  used  in  one  of  the  Equations  (1.8),  (1.9),  or 
(1.10)  to  produce  a  new  estimate  of  s.  Iterating  on  both  g*  and  s  can 


lead  to  serious  convergence  problems.  See  Birch  (1980b)  for  convergence 
results. 

If  f  is  known,  then  estimation  of  3  and  s  can  be  done  jointly  by 
solving 


n 


^i->^i  ^. 


(l.n)  I     X(-^-^)  =  na 

i=l     ^ 

for  s  simultaneously  with  (1.7),  where  x(A)  is 

(1.12)  X(A)  =  Aii;(A)  -  p(A) 

Huber  (1981)  shows  that  in  order  to  obtain  consistency  of  the  scale 
estimate  when  f  is  the  normal  distribution,  and  to  reproduce  the  standard 
OLS  estimate  of  scale  when  p(A)  =  A^/2,  the   constant  a   in  Equation 
(1.11)  must  be 


(1.13)      a  =  "-  V^^  Ejx) 


Convergence  problems  and  choice  of  scale  estimators  are  serious  problems 
in  using  a  i^  function.  In  Chapter  5,  a  method  is  presented  which  esti- 
mates a  variance  of  the  error  process  jointly  with  regression  parameters, 
but  avoids  both  these  problems. 

When  fitting  a  linear  model  to  a  set  of  data,  the  error  distribution 
is  generally  unknown.  Maximum  likelihood  estimates  cannot  be  obtained 
since  the  method  requires  the  specification  of  f.  In  this  case  one  would 
assume  a  form  for  f,  and  use  the  corresponding  ip.  It  would  be  desirable 
to  use  a  i|j  function  which  will  produce  estimates  3*  wriich  will  not  be 
sensitive  to  small  changes  in  ^.     Such  a  method  would  be  called  qualita- 
tively robust  (Hampel ,  1971).  Practitioners  have  chosen  least  squares 
because  of  its  computational  simplicity,  the  Central  Limit  Theorem,  and 


the  Theory  of  Errors  (Rao,  1973).  In  many  cases,  the  error  distribution 
cannot  be  assumed  to  be  normal,  either  because  of  knowledge  of  the  under- 
lying error  process,  examination  of  the  residuals,  or  the  presence  of 
outliers  (Barnett  and  Lewis,  1978).  In  particular,  if  the  error  distri- 
bution is  known  to  be  asymmetric,  or  bounded,  then  least  squares  cannot  be 
a  reasonable  estimation  method. 

Estimates  of  the  regression  parameter  vector  which  are  obtained  using 
a  ^   function  and  solving  Equations  (1.5)  are  known  as  M-estimates  (for 
maximum  likelihood,  although  it  must  be  noted  that  M-estimates  are  maximum 
likelihood  estimates  only  for  the  density  f,  such  that  i)  ^   -f'/f).  When 
the  error  distribution  is  unknown,  one  wishes  to  select  ^   so  that  the 
estimates  obtained  will  be  close  to  estimates  which  would  have  been  obtained 
had  f  been  known. 

Other  approaches  to  the  problem  of  fitting  parameters  when  the  error 
distribution  is  unknown  have  been  proposed. 

L-estimates  involve  a  linear  combination  of  order  statistics  and  can 
be  used  in  location  problems.  They  have  not  been  generalized  to  regression 
problems.  Similarly  R-estimates  are  linear  combinations  of  functions  of 
the  ranks  of  a  sample,  and  are  not  applicable  in  the  regression  setting. 

Denby  and  Larsen  (1977)  study  seven  alternatives  to  least  squares 
regression  via  Monte  Carlo  simulation.  Two  of  their  alternatives  are 
M-estimators.  Andrews  et  al .  (1972)  study  robust  estimates  of  location. 
Of  68  estimators  studied,  27  are  M-estimates.  Huber  (1981)  concludes 
that  of  all  the  techniques  for  obtaining  estimates  when  little  information 
concerning  the  error  density  is  available,  M-estimates  are  preferred 
because  they  are  more  easily  generalizable,  are  simpler  to  study  analyti- 
cally, and  have  performed  well  in  all  empirical  work  done  to  date. 


Several  reviews  of  robust  estimation  and  M-estimators  for  location 
and  regression  have  appeared.  Hampel  (1973)  reviews  estimators  of  location 
within  the  framework  of  qualitative  robustness.  Huber  (1973)  gives  basic 
results  on  M-estimators  for  regression  involving  asymptotic  design  pro- 
perties and  methods  of  calculation.  Hogg  (1979)  gives  a  short  review  of 
the  use  of  M-estimators'  in  applications.  Narula  and  Wellington  (1979) 
discuss  the  extension  of  M-estimators  from  their  use  in  estimating  location 
parameters  to  their  use  in  the  regression  setting.  The  most  comprehensive 
review  available  is  the  one  by  Ershov  (1979)  in  which  the  history  of  robust 
estimation  is  outlined,  and  M-estimators  are  compared  with  other  robust 
regression  estimators.  Ershov  cites  294  references  in  the  field  of  robust 
estimation.  Bickel  (1976)  presents  a  good  review  of  recent  developments. 
Box  (1978)  discusses  the  process  of  constructing  models  which  incorporate 
various  departures  from  classical  assumptions.  He  includes  M-estimates  as 
a  natural  extension  to  incorporate  non-normality  in  regression.  Hogg  (1978) 
provides  a  good  introduction  to  the  theory  and  application  of  M-estimates. 
Stigler  (1973)  provides  a  review  of  the  early  attempts  at  robust  estimation. 

Asymptotic  distributional  properties  have  been  studied  by  Huber  for 
the  location  problem  (1964)  and  the  regression  problem  (1973).  Yohai  and 
Maronna  (1979)  have  proven  under  stringent  regularity  conditions  that  3* 
will  be  asymptotically  normally  distributed  for  i)   and  f  not  related  by 
the  maximum  likelihood  criteria,  Polyak  and  Tsypkin  (1978)  have  stated 
simple  conditions  for  asymptotic  normality  without  proof.  Carrol  and 
Ruppert  (1979)  prove  almost  sure  convergence  under  mild  regularity  condi- 
tions. Asymptotic  results  are  presented  in  Chapter  2. 

Aside  from  the  distributional  questions  partially  answered  by  these 
results,  the  central  question  in  using  M-estimators  is  the  selection  of 


^  when  f  is  unknown.  The  effects  of  misspecification,  choosing  a  ip   function 
which  does  not  correspond  to  the  true  error  distribution,  are  illustrated 
using  two  examples  in  Chapter  3. 

Existing  methods  for  proceeding  to  use  an  M-estimator  when  the  error 
distribution  is  unknown  are  presented  in  Chapter  4.  These  methods  fall 
into  two  groups:  those  that  attempt  to  find  a  ^   function  which  will  work 
reasonably  well  whatever  f  may  be  (robust  methods),  and  those  that  choose 
^   after  looking  at  the  data  (adaptive  methods).  The  robust  methods  use 
three  basic  approaches:  empirical,  where  i];  and  f  choices  are  run  and 
compared  on  sets  of  data  to  measure  performance;  asymptotic,  where  i|;  is 
chosen  on  the  basis  of  an  asymptotic  optimality  criterion;  and  minimax, 
where  '^  is  chosen  to  protect  against  the  "worst"  error  distribution  in 
a  specified  class.  Adaptive  methods  use  a  statistic  calculated  from  a 
preliminary  fit  in  a  decision  rule  to  select  the  <];  function. 

Once  a  \p   function  is  chosen,  Equations  (1.6)  can  be  solved  to  yield 
estimates.  Whatever  the  method  of  selection  employed  to  choose  ^,   the 
resulting  fit  is  a  maximum  likelihood  fit  for  the  corresponding  f.  We 
can  express  f  in  terms  of  i^  by  solving 

(1.7)  ^i>  =   -fVf 
for  f,  yielding 

(1.8)  f  =  ke~^'^ 

r 


f(A)dA  =  1.   It 

/-co 


where  k  is  a  constant  which  can  be  derived  so  that 
seems  reasonable  to  expect  that  if  f^  is  the  true  error  distribution  and 
one  uses  ^^   to  fit  the  data,  then  f-,  corresponding  to  ^-^   via  (1.8)  should 
closely  resemble  f^.   If  f-,  is  unreasonable,  then  the  method  of  fitting, 
represented  by  ^■,,   is  unreasonable.  For  least  squares,  i)-^{^)   -—   and  f-|(A) 
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is  the  normal  distribution.  Experimental  designs  may  reduce  the  effect  of 
mi sspecifi cation  of  ^. 

Other  ii  functions  proposed  in  robustness  studies  can  be  solved  for 
their  corresponding  error  densities.  A  brief  table  of  the  most  popular 
^   functions  in  the  literature  is  given  in  Table  1.1. 

Once  a  ^   function  has  been  chosen,  usually  from  the  list  in  Table  1.1, 
one  can  solve  Equations  (1.6)  using  an  appropriate  numerical  technique. 
Holland  and  Welsch  (1977)  compare  three  basic  techniques  for  solving  the 
equations.  Newton's  method  is  the  most  straightforward,  but  requires  ii" 
to  exist.  Huber  (1973)  gives  a  simple  algorithm,  but  one  which  cannot  be 
used  with  existing  regression  software.  Dutter  (1977a)  refines  Huber's 
algorithm  for  speed,  but  implementing  this  algorithm  would  require  careful 
and  extensive  programming.  The  third  algorithm  is  iteratively  reweighted 
least  squares  (IRWLS)  (Beaton  and  Tukey,  1974).  They  conclude  that  for 
most  ^   functions  IRWLS  is  the  easiest  to  use,  most  reliable,  and  often 
the  fastest.  Dutter  (1977b)  presents  a  comprehensive  review  of  14  algo- 
rithms for  solving  Equations  (1.5),  applied  to  80  data  sets,  some  con- 
structed, some  "real  life"  data.  He  finds  no  striking  differences 
between  the  algorithms  in  their  accuracy  or  computer  time  usage.  For 
this  reason,  IRWLS  is  preferred  since  its  computations  are  quite  simple, 
and  can  be  performed  using  existing  regression  programs. 

To  perform  IRWLS  an  initial  estimate  Bq  of  the  parameter  vector  6 
is  required.  Initial  estimates  may  be  obtained  using  least  squares  or 
LAV  estimation.  Birch  (1980a)  has  compared  the  dependence  of  IRWLS  on 
the  starting  estimate.  To  iterate  to  a  new  estimate  at  the  ith  step, 
compute 

(1.14)    3|  -  3^_-,  +  (X'WX)"'  X'WA*._-| 


Table  1.1.  ^   functions  and  their  MLE  densities, 


Reference 


i>{A)   =   A/a2 


i{j(A)  =  sign(A) 


^{A)  =   sign(A)|A|P'"^ 


'I^(A)  = 


A      |A|<k 
sign(A) • k  |A|>k 


Rao,  1973 
Gentle,  1978 
Forsythe,  1972 


Huber,  1973 
"Proposal  two' 


Density 


Normal 


Double  Exponential 


Power  Family,  see 
Chapter  3 

Normal     lA|<k 
Exponential  lA  >k 


i|^(A)  -  L-tanh(^) 


4^(A)  =  aA/(h+(A-c)2) 


Bickel ,  1978 


Logistic 


Andrews  et  al.  ,1972  Cauchy 


l^(A) 


A      |A|<a 
a-sign(A)  a<A<b 
-b<A<-a 

a.c^AL   b<A;c 
c-b 

-c<A<-b 
0      |A|>c 


Normal      |A|<a 
Andrews  etal.,1972  Exponential  a<|A|<b 

U-shaped     b<|Al<c 


iJ;(A)   = 


sin(|)        |Al<CTr 
0  iA|>0 


Andrews,  1974 


f(A)  =  ke^"'^°^W|A|<OT 
k,  a  function  of  c 


^(A)  = 


A 
0 


|A|<h 
|Al>h 


Denby  and  Mallows, 
1977 


f(A) 


j-'^(h)-$(-h)-j 
.-aV2 


i(;(a)  =  A-e 


■a  A 


Ramsey,  1977 


Table  1.1.  Extended 
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Reference 


Density 


^i^)  ■- 

=   A- 

e   ^ 

l'(A)    = 

A 

1 

+     A 

Dennis  and 
Welsch,  1976 

Fair,  1974 


^(A)  = 


1  -  b 


A  >a 


ba 
0      lAka 


Ershov,  1979 


^(A)  =  ■ 


tan(^)  |A|<a 

CO. 


A  >a 


Ershov,  1979 


i|j(A)  =  A- 


Moberg  et  al . ,     f (A) 
1980 


2  /2:       -AV4 
flT/lT  ^ 


lA|P   D 


Gross,  1977 


^(A)  - 


a  -  |A|   '  ' 
0       |A|>a 


none 


f  triangular  on 
[-a, a] 


ip(A) 


Holland  and 
Welsch,  1977 


t  distribution  with 
V  degrees  of  freedom 


n 


where  X  is  an  nx(k+l)  matrix  of  fixed  points,  A*._,  is  an  nxl  column  vector 
of  estimates  of  the  residuals, 


(1.15) 


A*  ,  =  Y  -  XS-^  T 


and  W  is  an  nxn  matrix  (w  l),  where 


(1.16) 


^ab 


^((A|_^)a) 
(^i-l)a 


if  a^b 


a-b 


Bt  can  be  used  to  generate  a  new  A*  and  new  w,  and  these  in  turn  used  to 

generate  3*,i.  These  iterations  are  repeated  until  successive  iterations 
^       ~i  +  l 

produce  little  change  in  B*.  Convergence  properties  of  IRWLS  are  examined 
in  Birch  (1980b). 

IRWLS  is  easily  programmed.   It  is  available  in  SAS,  the  Statistical 
Analysis  System  (SAS  Institute,  1979),  in  PROC  NLIN.  A  short  program 
listing  is  provided  in  the  Appendix.  The  ROSEPAK  subroutine  library 
(Klema,  1976)  is  a  commercially  available  IRWLS  subroutine  library.   IRWLS 
is  not  recommended  when  ^   is  not  continuous.  In  particular,  when  LAV 
estimation  is  desired,  special  algorithms  exist.  Barrodale  and  Roberts 
(1974)  solve  the  LAV  estimation  problem  using  a  modified  simplex  algorithm. 
Sposito  et  al.  (1977)  present  a  special  algorithm  for  ii;(A)  =  sign(A)  |  A  |P"1  . 

Direct  optimization  of  (1.5)  using  Price's  controlled  random  search 
procedure  (Price,  1977)  is  inefficient,  but  can  be  applied  to  nonlinear 
and  constrained  models,  as  well  as  the  model  of  (1.1). 

Iterating  until  convergence  in  IRWLS  may  not  be  required  in  particu- 
lar cases.  Bickel  (1975)  argues  that  estimates  derived  from  one  Newton 
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iteration  have  the  same  asymptotic  properties  as  fully  iterated  estimates, 
and  perform  comparably  on  small  samples.  These  "one  step"  estimators  are 
very   quick  to  compute. 

Many  of  the  ^   functions  presented  in  Table  1.1  have  constants  in 
them  which  must  be  specified  by  the  practitioner.  Optimal  values  for 
each  constant  exist  in  terms  of  the  true  error  distribution,  but  before 
analysis,  these  values  are  unknown.  Holland  and  Welsch  (1977)  give 
constants  for  several  of  these  'i)  functions  to  achieve  95%  asymptotic 
efficiency  when  f  is  the  normal  distribution. 

Once  estimates  have  been  obtained,  several  questions  relating  to 
any  regression  problem  can  be  stated.  One  class  of  problems  involves 
optimal  design  of  regression  experiments  when  the  data  come  from  an 
unknown  distribution.  Box  and  Watson  (1962)  propose  a  criteria  for  first 
order  designs  to  minimize  the  effects  of  non-normality  on  F-tests.  These 
designs  are  called  zero  kurtosis  designs.  The  design  kurtosis  C^,  is  given 
in  Khuri  and  Myers  (1981)  as 

n  17^        r  -  (n-l)[n(n+l)h  -  k(k+2)(n-l)] 
^^'^^>  h        k(n  -  k  -  l)(n  -  3) 

where 

(1.18)  h  =  I       h?. 

i  =  l   ^^ 

and  h..  is  the  ith  diagonal  element  of  H  =  D(D'D)"-^D,  where  D  is  such 

that  X  =  [1:D].   Khuri  and  Myers  give  necessary  and  sufficient  conditions 

for  these  designs  to  exist  and  describe  a  method  for  construction  of  such 

designs  for  a  given  number  of  experimental  runs.  Vuchkov  and  Solakov  (1980) 

demonstrate  that  the  kurtosis  is  related  to  the  distribution  of  the  repeated 

points  among  the  support  points  of  the  design. 
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Box  and  Draper  (1975)  indicate  that  the  quantity  h  in  (1.18)  is  a 
measure  of  design  sensitivity  to  outliers,  and  present  14  general  proper- 
ties of  good  response  surface  designs.  Herzberg  and  Andrews  (1976)  intro- 
duce two  additional  criteria.  Andrews  and  Herzberg  (1979)  compare  the 
criteria  and  calculate  values  for  several  designs.  Huber  (1975)  considers 
distributional  robustness  properties  of  designs  and  model  robustness.  He 
recommends  each  h..  be  much  less  than  1,  and  that  lack  of  distributional 
robustness  produces  far  more  disastrous  results  than  failures  of  model 

robustness. 

Variable  selection  when  using  estimators  other  than  OLS  have  been 
studied  only  for  two  alternatives.  Gentle  and  Hanson  (1977)  propose 
techniques  when  using  LAV  estimators.  Narula  and  Wellington  (1977) 
investigate  variable  selection  when  the  criteria  is  minimization  of  maxi- 
mum weighted  absolute  errors.  General  variable  selection  using  M-estimators 
has  not  been  studied. 

Other  questions  related  to  M-estimators  involve  the  study  of  residuals, 
subsequent  tests  of  hypotheses,  and  calculation  of  multiple  comparisons. 
Bickel  (1976)  introduces  the  concept  of  "pseudo  observations"  in  M-esti- 
mation,  as  an  attempt  to  answer  these  questions.  Let 

(1.19)  y^  =  x:3*  +  X*-i  ifj(r.) 

where 

1   " 


(1.20)  X*  =^   I  Vir.) 


and 


'^i  "  ^i  -  ^i!* 


The  pseudo  observations  are  defined  so  that  the  least  squares  estimate 
3  based  on  the  pseudo  observations  is  equal  to  the  robust  estimate  3* 
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obtained  from  the  original  data.  The  pseudo  observations  can  be  used  for 
further  analysis  such  as  successive  testing,  multiple  comparisons,  C  plots, 
ANOVA  tables,  and  residual  plots.  No  rigorous  results  on  the  performance 
of  pseudo  observations  have  appeared  in  the  literature.  Their  existence 
does,  however,  provide  the  data  analyst  v/ith  a  useful  tool  for  performing 
a  "standard"  regression  analysis  when  using  an  M-estimator. 

M-estimators  are  considered  in  relation  to  outlier  processing  techni- 
ques by  Barnett  and  Lewis  (1978).  M-estimates  accommodate  outliers  by 
estimating  from  all  the  data.  M-estimators  which  have  ijj  functions  which 
redescend  to  zero  effectively  reject  outliers  by  giving  them  weight  0  in 
Equations  (1.5).  Andrews  (1974)  demonstrates  the  ability  of  the  SINE  \p 
function  to  down  weight  outliers,  using  the  data  of  Daniel  and  Wood  (1971) 
as  an  example.  M-estimators  which  descend  to  zero  asymptotically  down 
weight  outliers,  and  thereby  accommodate  them  with  small  weight.  The 
Bisquare  estimator  (Gross,  1977)  is  a  good  example  of  an  asymptotically 
redescending  M-estimator. 

M-estimators  which  redescend  to  zero  trim  the  data  at  specific 
values,  controlled  by  tuning  constants.  These  techniques  can  accommodate 
outliers  generated  by  a  process  other  than  the  true  error  process,  if  the 
contaminating  values  are  outside  the  range  of  values  to  be  kept  after 
trimming.  Trimming  techniques  will  fail  if  no  data  fall  inside  the 
acceptable  range.  In  this  case,  the  tuning  constants  may  be  adjusted 
and  the  procedure  repeated.  . 

M-estimators,  whose  ij;  functions  are  always  nonzero,  give  weight  to 
all  points  in  the  sample  and  are  not  used  to  model  truly  contaminated 
data.  These  estimators  model  data  which  inherently  contain  outliers  due 
to  the  heavy  tailed  characteristics  of  the  presumed  error  density. 
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M-estimators  may  perform  well  in  situations  which  they  do  not  model. 
This  is  a  measure  of  their  robustness,  not  their  validity  for  use  in 
handling  outliers. 

No  single  choice  of  i)   is  likely  to  be  found  which  will  perform  well 
in  the  wide  variety  of  situations  encountered  with  experimental  data. 
Current  adaptive  techniques,  which  calculate  a  statistic  of  the  data 
before  choosing  a  i|j,  are  limited  to  a  small  catalogue  of  alternatives. 
In  Chapter  2  the  basic  asymptotic  foundations  of  M-estimators,  where  \b 
is  chosen  a  priori,  are  reviewed.  Chapter  3  details  the  problems  which 
arise  when  the  wrong  ^   function  is  chosen.  Chapter  4  presents  the  known 
techniques  for  choosing  a  ip  function,  both  a  priori  and  adaptively. 

All  these  techniques  must  eventually  perform  poorly  in  situations 
where  the  error  distribution  currently  under  study  was  not  considered  in 
simulations  for  verifying  the  performance  of  fixed  estimators.  In 
Chapter  5,  a  continuously  adaptive  M-estimator  is  introduced,  which  uses 
the  data  to  determine  the  appropriate  i)   function.  Unlike  current  adaptive 
techniques,  however,  the  choices  of  i|j  are  presented  in  a  continuum  of 
several  dimensions  rather  than  as  a  finite  set  of  choices.  This  continuum 
includes  ^p   functions  which  correspond  to  asymmetric  densities  as  well  as 
symmetric  possibilities. 

The  technique  is  shown  to  be  able  to  recover  the  information  from  a 
sample  regarding  the  shape  of  the  error  distribution  as  well  as  estimates 
of  the  regression  parameters.  Numerical  algorithms  and  examples  of  their 
use  are  presented,  as  well  as  performance  comparisons  for  least  squares. 

Chapter  6  contains  a  review  and  some  suggestions  for  future  research. 


CHAPTER  2 
ASYMPTOTIC   FOUNDATIONS  OF  M-ESTIMATION 


A  central   question  regarding  M-estimators  for  robust  regression  is 
the  asymptotic  distribution  of  the  parameter  estimate  vector.      If 
^  =  -f'/f,   then  the  M-estimate  is  a  maximum  likelihood  estimate  and  is 
known  to  have  an  asymptotically  normal    distribution   (Rao,   1973).     When 
\l)  is   chosen  independently  of  f,   as   in  an  application  where  the  true  error 
distribution   is   unknown,   the  asymptotic  distribution  of  B*,   the  vector  of 
parameter  estimates  can  also,   under  certain  regularity  conditions,   be 
shown  to  be  asymptotically  multivariate  normal.     This  result  is  most 
clearly  stated  in  Theorem  2.3  by  Polyak     and  Tsypkin   (1978). 

Huber  (1964)   introduced  the  concept  of  an  M-estimator  in  the  location 
problem     and  presented  basic  asymptotic  results.     Conditions  for  almost 
sure  convergence  of  an  estimator  based  on  ip  when  the  underlying  distribu- 
tion is  given  by  f,  with  cdf  F,  are 

(2.1)  p  must  be  strictly  convex 

(2.2)  3§o   such   that  X(§    )   exists   and  is   finite,  where 

r 


(2.3)  X(§ J    =       ^(A   -    § J    F(dA) 


0 


0 


(2.4)  3c   3  §  >  c  =^    .\(§)  <  0     and     §  <  c  ^A(§)  >  0 

If  in  addition,  the  conditions  below  are  satisfied,  then  the  estimator 
is  asymptotically  normal.   If  y*  is  the  estimate,  then 


(2.5)      /J^  (y*  -  c)  -^     n(0,V) 
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where 
(2.6) 


,J;2  (a  -  C)    F(dA) 


X'(C)' 


The  necessary  conditions  are 


(2.8) 
(2.9) 


X(c)  =  0 
x(§)  is  differentiable  at  §  =  c  and  X^(c)  <  0 

i|j2  (a  -  §)  F(dA)  is  finite  and  continuous  at  §  =  c. 


Andrews  et  al .  (1972)  present  asymptotic  variance  expressions  for  many 
M-estimators  of  location. 

A  useful  concept  for  comparing  M-estimators  of  location  is  the 
influence  curve  (Hampel,  1958).  The  influence  curve  measures  the  change 
in  the  estimate  obtained  by  making  an  infinitesimal  change  in  the  error 
distribution.  Estimators  can  be  considered  functionals  of  the  empirical 
distribution  function,  and  the  quantity  to  be  estimated  is  a  functional 
of  the  error  distribution,  represented  by  its  cumulative  distribution 
function.  For  example,  the  mean  of  a  distribution  is  a  functional  as 

shown  below 

r 


(2.10) 


T(F) 


A  F(dA) 


The  influence  curve  IC(a;T,F)  is  defined  below. 


(2.11)     IC(a;T,F)  =  lim 

e  +  o 


T((1  -  £)F  +  e-^a)  -  T(F) 


where  6  is  the  cumulative  distribution  of  a  random  variable  having  point 
mass  at  a.  It  can  be  shown  that  the  asymptotic  variance  A(F)  in  the 
location  case  is  related  to  the  influence  curve  as  shown  below  (Hampel, 


1974) 


(2.12) 


A(F) 


18 


IC(a;T,F)2  F(dA) 


M-estimates  of  location  and  scale  can  be  expressed  as  functionals 
(Andrews  et  al.,  1972)   T(F)  and  S(F)  satisfy 


(2.13) 
(2.14) 
where  x(^) 
(2.15) 


^ 


X 


f  A  -   T(F)   1 
S(F)        ^ 


F(dA)  =  0 


(^i^)F{dA)=0 


Ai|;(a)   -  p(a).     The  influence  curves  are  given  by 


IC(A;T,F) 


S(F) 


[WJ  j 


F(dy) 


•     i' 


SIFT 


(2.16)  IC(AiS,F) 


Sill 


SlFl 


SlFl 


F(dy) 


HsFTj 


Hampel    (1974)   defines  several   measures  derived  from  the  influence 
curve  which  can  be  used  to  compare  estimators.     Huber  (1981)   reviews 
these  measures  as  do  Barnett  and  Lewis   (1978).     Empirical  work  has  not 
included  references  to  these  measures,  or  compared  them  to  results 
obtained  in  Monte  Carlo  studies. 

In  regression,   influence  of  new  data  on  OLS  with  normal   error  is 

discussed  by  Belsley,   Kuh  and  Welsch    (1980).      They  show  that  the   influence 

of  the  ith  data  point  on  the  regression  parameter  estimates  3  can  be 

measured  by  differentiating  the  fitted  regression  coefficients  with  respect 

to  the  weight,  w.,   of  the  ith  observation,   and  evaluating  the  result  at 

w.   =   1.     This   differentiation   result  is   given  below: 
1 
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(2.17) 


83 


=  (X^X)"i  X.  r. 


w.=l 


Similar  results  for  M-estimators  in  regression  are  not  available.  Asymp- 
totic results  in  regression  are  derived  not  from  influence  curves,  but 

from  first  principles. 

Almost  sure  convergence  of  B*  to  3  in  the  regression  problem  was 
first  proved  by  Hijber  (1973).  This  proof  requires 


(2.18) 
where 
(2.19) 

(2.20) 
(2.21) 


ep 


2->0 


max     T^T  =  £   »  ^i]  -   (X(X'X)-iXO^i  >  ^^xp 


i,   must  be  continuous  and  bounded 


Ef(^(A))  =  0 


This  last  condition  will  be  true  when  both  f  and  p  are  symmetric  functions, 
but  if  either  p  or  f  or  both  are  not  symmetric,  this  last  condition  will 
not  hold.  Huber's  proof  does  not  include  the  special  cases  of  least  squares 
U  is  unbounded,  violating  (2.20))  or  LAV  {,   is  discontinuous,  also  violating 

(2.20)). 

Carrol  and  Ruppert  (1979)  present  a  new  proof  of  the  almost  sure  con- 
vergence of  M-estimators  in  regression,  which  does  not  require  symmetry, 
but  does  require  for  some  a,Ki,K2  >  0  that 


(2.22) 


iJj(a)  =  sign(A)K-| 


aI  <  K 


2 


aI  <  K. 


(2.23)  "Pi^)   ^   a-l^l         '"'  ^  "2 

This  condition  is  satisfied  by  few  of  the  i>   functions  in  Table  1.1 
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Ershov  (1979)   states  conditions  to  insure  that  S*  will   converge 
almost  surely  to  3.     Let  F  be  the  cumulative  distribution  function  of  A. 

Theorem  2.1    (Ershov,   1979)  Almost  sure  convergence  of  M-estimates. 
Let  p  satisfy  the  following  conditions: 
(i)       p  is  symmetric, 
(ii)     p  is  non-negative. 

(iii)  p  satisfies  a  Lipshitz  condition  of  order  1. 
(iv)     either 

(a)  p  is   convex  and   3z  3\/5  >  0,   F(z  +  5)   >     F(z  -   6),   and 
p(z  +  6)   +  p(z  -  6)   >  2p(z) 

or 

(b)  p  is  monotone  for  z  >  0  and  f  =  F'  ^  f(z-,)  <  f  (z^) , 
where  z-,  >  z^  >  0  and  3z  >  0  3  p(z  +  5)  >  p(z  -  6) 
and  f(z  +  5)  <  f(z  -  6)  for  sufficiently  small  6>0. 

Then  B*  ->  3  almost  surely,  when  B*  is  any  solution  of  (1.6). 
No  proof  of  the  theorem  was  given. 

These  conditions  on  p,  and  on  the  underlying  distribution  represented 
by  its  cumulative  distribution  function  F,  and  its  probability  density 
function  f  can  readily  be  assumed  to  hold  under  data  analytic  conditions. 
Thus  M-estimates  will  converge  almost  surely  to  the  true  parameter  vector 
regardless  of  the  relationships  of  \p   and  f.  An  immediate  corollary  involves 
the  distribution  of  the  residuals  r.  =  y^.  -  xjB*. 

Corollary  2.2  Convergence  in  distribution  of  the  residuals 

The  residuals  converge  in  distribution  to  the  distribution  of  f, 
under  the  conditions  of  Theorem  2.1. 
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Proof:     r.   =  y-   -   x.3*  ,  i  =  l,...,n   or,  in  matrix  notation, 

R  =  Y  -  XB* 

R  =  X3  +  A  -  X3*   from  Model  (1.1) 

R  -  X(3  -  3*)  +  A 

But  3*  converges  to  3  almost  surely,  so  X(3  -  3*)  converges 
to  zero  almost  surely.  A.  is  distributed  as  f,  so  r.  converges 
in  distribution  to  A.,  distributed  as  f. 

This  result  insures  that  as  the  sample  size  increases,  the  distri- 
bution of  the  residuals  will  approach  the  true  error  distribution,  regard- 
less of  the  tjj  function  used.  Anscombe  and  Tukey  (1963)  present  formulas 
for  estimating  the  moments  of  an  error  distribution  from  residuals  obtained 
using  OLS.  Andrews  (1978)  considers  the  use  of  residual  displays  when 
methods  other  than  least  squares  are  used  and  the  deficiencies  of  examining 
elementary  residual  displays  when  OLS  is  used  on  non-normal  data.  For 
small  samples  OLS  produces  residuals  which  are  more  nearly  mound  shaped 
than  the  true  error  process. 

As  an  example  of  the  inadequacy  of  OLS  residual  analysis  with  small 
samples,  and  of  the  convergence  of  OLS  residuals  to  the  true  underlying 
distribution  in  large  samples,  consider  data  as  described  by  Equation  (2.24), 

(2.24)  y  =  2  +  .4x  +  A 

where  x  is  distributed  uniformly  on  [0,1]  and  A  is  distributed  as 
GLF(-.886,  .1333,  .0193,  .1588).  The  GLF  distributions  are  described  in 
Chapter  5.  The  error  distribution  used  here  has  mean  0,  variance  1, 
skewness  1,  and  kurtosis  4.  The  true  error  density  is  shown  in  Figure  2.1. 
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Figure  2.2  illustrates  the  problem  of  investigating  simple  OLS  residual 
displays.     Fifty  observations  were  generated  according  to   (2.24)  and  fitted 
using  OLS.     A  histogram  of  the  true  errors  is  drawn  next  to  a  histogram 
of  the  least  squares   residuals.     Figure  2.3  shows  the  same  histograms 
generated  from  500  data  points.     In  Figure  2.2  the  OLS  residuals  are  more 
nearly  mound-shaped  than  the  50  points  generated  from  the  density  of 
Figure  2.1.      In  Figure  2.3,   the  large  sample  size  indicates  that  the 
asymptotic  result  of  Corollary  2.2  is  approached  for  a  sample  of  500  for 
this  distribution.     Only  with  a  large  quantity  of  data  can  such  a  display 
indicate  the  shape  of  the  underlying  distribution. 

Asymptotic  multivariate  normality  of  B*  was   first  considered  by 
Huber   (1973).     As   in  the  proof  of  almost  sure  convergence,  ep^-^0  was 
required.     No  formal    proof  was   given.      Yohai   and  Maronna   (1979)   present 
a  complete  proof  of  asymptotic  normality  using  involved  conditions  on  \p 
and  f.     The  statement  of  their  theorsnis  given  below. 

Theorem  2.2     (Yohai   and  Maronna,  1979)     Asymptotic  normality  of  M-estimators, 
The  assumptions  below  are  required  by  \p  and  f. 
Al )     i|j  is     nondecreasing. 

A2)     3b,c,d  >  0  3   D(u,z)    >  d     if   |u|    <  c     and    \z\    <  h  where  c  is  such 
that  F(c)   -   F(-c)   >  0     and  D(u,z)   =    (ti;(u  +  z)   -  '^(u))/z 


A3)      I    ijj^dF  =    V  <  oo 


A4) 


i>dF  =  0 


A5)  For  a  sample  of  size  n,  let  X  be  the  nx(l<+l)  matrix,  then 


A6) 


3n     3  Vn  >  n   ,   X'X     is  nonsingular. 


[ip(x  +  h)   -  ^{x  -  h)]2  dF  =  o(l)   as  h  ->  0     and 
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sup 

|q|  <  e,    |h|  <  e 


|h|   J 


[^(x  +  q  +  h)  -  i|;(x  +  q)]dF 


.   <   CO 


for  some  e  >  0. 


r 


A7)  3A(^,F)   3 


[^(x  +  h)  -  i|;(x)]dF  =  hA(^,F)  +  o(h) 


J.o 


A8)   lim   max    z? 
n-Ko  l<i<n 


0    where 


z^.  =  (M')'^  x^.  and  M.  is  any  (k+l)x(k+l)  matrix,  such  that 


Then 


where 


M3*  ■ 

-  MB  — >    MVN(0,t2I) 
-    D 

f  ilj^dF 
A(i|;,F)2 

While  a  rigorous  proof  of  this  theorem  exists   in  the  literature,   its   use 
is  severely  limited  by  the  difficult  to  verify  regularity  conditions. 

Ershov   (1979)   states  a  theorem  presented  by  Polyak  and  Tsypkin   (1978) 
of  the  asymptotic  normality  of  M-estimators ,   but  no  proof  has   been  given. 
The  statement  is  given  below. 

Theorem  2.3     (Polyak  and  Tsypkin,   1978)     Asymptotic  normality  of  M-estimators, 
Suppose  that  p  is  convex,   there  exists       f  =  F",   f  symmetric,     and 
3z    3  f (z)   >  0  and   V5>0,   p(z  +  5)   +  p(z  -   5)   >  2p(z),   then 

(2.25)      7n    (S*  -   e)     — ^  MVN   (0,    (X^X)"^   AVF(^,f)) 


where 


(2.26) 


AVF(iJ;,f) 
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if;^    fdA 


i/j'   fdA 


This  theorem  is  the  fundamental   result  of  the  study  of  M-estimators 
in  regression.     The  only  restrictions  on  f  are  that  f  must  be  symmetric 
and  exist  as  a  derivative  of  F.     These  assumptions  can  easily  be  accepted 
in  data  analytic  situations.     This   theorem  has   several    immediate  conse- 
quences. 

Corollary  2.4     Under  the  regularity  conditions   of  Theorem  2.3,   3*  is 
asymptotically  unbiased  and  squared  error  consistent. 
Proof:     obvious   from   (2.25) 

Harvey   (1978)   shows  that  for  g*  calculated  using  IRWLS,   if  f  is 
symmetric  and  p  is  even,   then  a  symmetric  estimator  of  the  starting  point, 
such  as  OLS,  will   not,   in  general,   lead  to  a  symmetrically  distributed  g*. 
An  asymmetric  estimator,   such  as  LAV  is  necessary  as  a  starting  point  to 
assure  symmetry  of  B*  when   using   IRWLS. 

Corollary  2.5     Asymptotic  design  criteria   depend  on    (X'X)"'^   only,   not  on 

ip  or  f. 
Proof:     obvious   from   (2.25) 

Designs   constructed  to  be  robust  against  non-normality  of  the  errors 
protect  against  small   sample  properties  of  3*.     Asymptotically,  design 
results   for  OLS  which  depend  on  criteria  relating  to  properties  of  (X"X)"\ 
for  example  D-optimality,  apply  to  M-estimators  as  well. 
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Thus,  mi  si dentifi cation  of  f  may  cause  small  sample  bias  problems, 
and  inflation  of  variance,  but  asymptotically,  0*  is  unbiased  and  converges 
almost  surely  to  3.  In  addition,  if  we  know  t>,  f,  and  X,  we  can  explicitly 
state  what  the  asymptotic  covariance  matrix  of  3*  will  be.  This  enables 
us  to  determine  how  much  data  will  be  needed  to  develop  estimates  stated 
with  the  same  variance  when  the  error  distribution  changes.  Some  distri- 
butions are  easier  to  estimate  than  others,  that  is,  they  require  less 
data  to  achieve  estimates  whose  variances  are  equivalent  to  estimates 
developed  using  more  data  from  a  more  difficult  to  estimate  distribution. 
Because  of  its  special  importance,  the  ratio  of  integrals  in  Equation  (2.26) 
will  be  called  the  asymptotic  variance  factor  (AVF),  that  is, 


(2.27) 


AVF(t|;,f)  = 


^^   fdA 


^^   fdA 


Note  that  if  iJj  =  -f'/f,   then  AVF(i|;,f)   reduces  to 


(2.28) 


AVF 


-,f 


f 


'2 


f 


dA 


1 

TTfT 


which  is  the  Cramer-Rao  lower  bound  for  the  variance,  where  1(f)  is  the 
Fisher  information  of  f.  This  result  suggests  a  minimax  approach  to 
choosing  4)   using  a  least  favorable  distribution  f,  that  is,  a  distribution 
with  minimum  Fisher  information.  The  result  is  stated  by  Ershov  (1979) 

Theorem  2.5  (Ershov,  1979)  Existence  of  minimax  estimator. 

Let  F  be  a  convex  class  of  symmetric  distributions  for  which  G  = 
{f:  1(f)  <  <»}  is  dense  in  F,  and  such  that  3f  e  F  such  that  1(f)  <  1(f) 


Vf  e   F.  Let  ^^  =  -yfo,  p^ 
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In  f  .  If  p  is  convex,  satisfies  the 

0        0 


Lipshitz  condition  of  order  1,  and  the  conditions  of  Theorem  2.3,  then 
Vf  £  F, 


(2.29)      AVF(^Q,f)  <  MH^lJ^,f^)   <  MH^Jq) 


No  proof  of  this  theorem  was  given.     The  theorem  provides  a  mini  max 
approach  to  choosing  a  i|;  function.     Examples  of  the  use  of  the  approach 
are  given  in  Chapter  4.     Huber  (1981)   considers  minimax  M-estimators  of 
scale  and  location. 


CHAPTER  3 
MISSPECIFICATION  OF  ij; 


Under  regularity  conditions  stated  in  Chapter  2,  B*  will  converge 
almost  surely  to  3,  even  in  cases  where  ij;  has  no  relation  to  the  maximum 
likelihood  '^  function  for  the  true  error  distribution.  Similarly 
/n  (p*  -  3)  converges  to  a  multivariate  normal  distribution.  The  asymp- 
totic variance  covariance  matrix  of  3*  depends  on  the  relationship  between 
the  ip   function  used  in  the  estimation  of  3*  and  f,  the  true  error  distri- 
bution. The  effects  of  misspecifying  -^j   are  illustrated  in  this  chapter 
in  terms  of  the  size  of  the  AVF  for  two  families  of  error  distributions. 

The  power  family  of  distributions  can  be  derived  as  that  family  for 
which  ^{^)  =  k-sign(A)  lAJ^'^  produces  the  maximum  likelihood  estimates. 
A  general  form  for  the  p  function  is 

P 


(3.1) 
so  that 
(3.2) 


P(A) 


f(A)  =  k  e 


-P(A) 


To  find  k  so  that  f(A)   integrates  to  1   over  the  real   line,   let 


(3.3) 
(3.4) 


AnP 


c(-) 


dA  = 


l.\  P 


^  -1 


(t) 


g 
c-p 


dz 


We  require 
(3.5) 


k  e-P^^)     dA  =   1 
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(3.6) 
(3.7) 


k(l/c)^^/P)-^   ^  z^^/P)-^  e-^/^ 


r(i/p)2-k-c'"'/P 
p 


cp 


^(l/p)-l   g-z/1 


r(i/p) 


i/P 


dz  =  1 


dz  =  1 


The  integral  in  (3.7)  is  that  of  a  Gamma  (l/p,l),  so 

J/P 


(3.8) 


^   2T(l/p)a 


This  family  includes  the  normal  (c=l/2,p=2),  and  the  double  exponential 
(c=l,p=l).  Forsythe  (1972)  introduced  the  p  function.  Box  (1978)  pre- 
sented an  alternative  parameterization.  Regression  based  on  minimizing 

n 
the  pth  power  of  the  residuals  is  appropriate  when   Y   |a-|P    can 

be  assumed  to  be  x  ,  since  if 


(3.9) 


f(A) 


2C 


1/p 


■cIa/c 


2r(l/p)a 


and  if  z  =  2c  I  A/a  I  ,  then  the  density  of  z  is 


£C 


1/P 


(3.10)    g(z)=|^p^   e 


-z/2 


.2a- 


P(2c) 


1/P 


,(l/p)-l 


(3.11)    g(z)  =  -^ 


1 


r(i/p) 


^(l/p)-l   g-z/2 


That  is,  z  ~  x' 


2/p- 


If  A-  are  iid  f,  then  z.  are  iid  x' 


2/P 


°  X  ^• 


i=l 


is  X  2n/D'  ^'^   P'  ^'  ^"'^  ^  ^^^   known,  then  a  test  of  whether  a^  are  iid  f 
is  given  by  the  rejection  region 


(3.12) 


I    |A, 


P  >  ^   x' 

2c   '^  2n/p 


For  the  power  family,  the  AVF  can  be  calculated  as  shown  below: 
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(3.13) 
(3.14) 


fdij; 


r 


J-o 


ke 


-p(A)     cp(p-l) 


p-2 


dA 


1  .   k-c-p-(p-l)    . 


P      (2c)^-2/P       aM2c)^/P 
Jl/P)-1 


a 


exp{-2/2}z 


1-2/p 


dz 


(3.15) 

(3.16) 

(3.17) 

(3.18) 
(3.19) 
(3.20) 

(3.21) 

(3.22) 


^•^•<^-iP-^)      r(i-i/p)    2^-^/P 

a(2c)'"'/P 

2k(p-1)r(1-l/p) 
-1/p 


o  Pc^/P  (p-i)r(i-i/p) 


,-l/P  e-/2 


•'0 


r(i/p)  2 


i-i/P 


dz 


r(l/p)2a  -1/p 


ac 


_  r(1-l/p)        p(p-l)  2/p 

-    r(i/p)       ^^~     ' 


i(;^fdA 


2r^2 


k  exp{-p(A)}  ^ 


2p-2 


2r,2 


2k     c^p 


p      (? 


(2c)2-2P(2c)^/P 


dA 


^-z/2   ^2-2/p   ^(l/p)-l    ^^ 


2kc"pr(2-l/p)2^""'/P 


a(2c) 


2-1/p 


•'o 


r(2-i/p)  2^"^/P 


dz 


J/P 


a        2r(l/p)o       '  ^'^    '/P^    ^ 


(3  23)  =^  C         r(2-l/p)        1/p 

^^•^^^  0^         r(l/p)  "^ 


(3.24)   m{^,f)=  ■■ 


\p^   fdA 


fdij; 


2t         r(2-l/p)      2/p 

^       r(l/p)      ^ 

r(i-i/p)     .     P(P-1)      2/p' 
r(i/p)         ^^    ' 
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(3.25) 


r(i-i/p) 


r(i/p) 


c2/P(p-i)^ 


which  reduces  to  a^  in  the  normal  case,  as  it  should. 

Suppose  that  p  is  unknown,  so  that  we  use  a  ^l)   function  from  the  power 
family  with  power  q,  rather  than  the  true  power  p.  The  AVF(i|j,f)  can  be 
calculated  using  the  steps  below. 


(3.26) 


(3.27) 


fdip  = 


k  exp{-p(A)}  ^%Ii 


q-2 


2k  c  q(q-1)     i     g 
aM2c)(q-2)/P  ■  P  ■  (2c)^/P 

,^q/p-2/p  ^1/p-l   ^^ 


dA 


exp{-z/2}. 


(3.28) 


(3.29) 


=  2kcq(q-l)a  2^^~"'^^P  r((q-l)/p) 
a^p(2c)^'^-^^/P 

^  2cq(q-l)  r((q-l)/p).p-c^/P 
pac^^'^^/''  r(l/p)2a 


^(q-1)/p  g-z/2 

r(q-i)/p  z^^-^^/P 


dz 


(3.30) 

(3.31) 
(3.32) 


r((q-l)/p)  .  q(q-l)   Jp-q+2)/p 
r(l/p)     ^^~   '^ 


^j^fdA  = 


2„2 


k  exp{-p(A)}  -^ 


2q-2 


dA 


2ki 


cy   .  (2c)(2q-2)/p  .  a  .  1 


P    (2c) 
,(2q-2)/p  ^(l/p)-l   ^^ 


1/P 


exp{-z/2}. 
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(3.33) 


(3.34) 


(3.35) 


(3.36)  AVF(i|),f) 


2kc^q^ 


op  (2c)^2q-l)/p  '  ^  P  ^  ' 

.xp{-z/2}  z((^q-^^/P^-^ 

r(2q-ix  2(2q-i)/P 
p 


£C 


1/p 


T2?^TW  2r(l)a 


p/2q-K 
ap 

^^  P  '         al    2((p-q+l)/p) 

r(l)    ^ 
p 

r(2q-1)  q2c2((p-q+l)/p) 

r(1/p)  g^ 

~1  2 


rr(^ 
r(i) 


q.(q-l) 


,(p-q+2)/p 


(3.37) 


r(2azl)  q2,2((p-q+1)/p)         r^l)  g^ 

r(Y)  q'(q-i)^  c2^^p-q^2]/p) 


r(^)  g^ 


(3.38) 


(q-l)^  c^'" 


When  p=q  this  reduces  to  (3.25).  When  p=q=2,  c=l/2  this  reduces  to  just 
g^ ,  the  least  squares,  normal  AVF. 

AVF(4^,f)  can  easily  be  calculated  for  various  combinations  of  q,  p, 
and  c.  Table  3.1  contains  values  of  AVF  for  the  power  family.  From  the 
table  we  can  see  that  if  the  error  distribution  is  normal,  and  LAV  estima- 
tion is  used  (i.e.,  c=l/2,  p=2,q=l),  the  asymptotic  variance  of  g*  is 
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Table  3.1   AVF  in  the  Power  Family 


.5 


Q 

1.0 

1.1 

1.2 

1.3 

1.4 

P 
1.5 

1.6 

1.7 

1.8 

1.9 

2.0 

1.00 

4.00 

3.28 

2.81 

2.48 

2.24 

2.05 

1.91 

1.80 

1.71 

1.63 

1.57 

1.10 

4.06 

3.24 

2.71 

2.35 

2.09 

1.89 

1.74 

1.63 

1.53 

1.46 

1.39 

1.20 

4.21 

3.27 

2.68 

2.28 

2.00 

1.80 

1.64 

1.51 

1.42 

1.34 

1.27 

1.30 

4.44 

3.37 

2.70 

2.27 

1.96 

1.74 

1.57 

1.44 

1.34 

1.25 

1.19 

1.40 

4.73 

3.51 

2.77 

2.28 

1.95 

1.71 

1.53 

1.39 

1.28 

1.19 

1.12 

1.50 

5.09 

3.69 

2.86 

2.32 

1.96 

1.70 

1.51 

1.36 

1.24 

1.15 

1.08 

1.60 

5.52 

3.92 

2.98 

2.39 

1.99 

1.71 

1.50 

1.34 

1.22 

1.12 

1.05 

1.70 

6.02 

4.18 

3.13 

2.47 

2.03 

1.73 

1.51 

1.34 

1.21 

1.11 

1.02 

1.80 

6.59 

4.49 

3.31 

2.57 

2.09 

1.76 

1.52 

1.34 

1.20 

1.10 

1.01 

1.90 

7.25 

4.85 

3.51 

2.70 

2.17 

1.81 

1.55 

1.35 

1.21 

1.09 

1.00 

2.00 

8.00 

5.25 

3.74 

2.84 

2.26 

1.86 

1.58 

1.37 

1.22 

1.10 

1.00 

75 


1.0   1.1   1.2   1.3   1.4   1.5   1.6   1.7 


.8   1.9   2.0 


1.00 

1.78 

1.57 

1.43 

1.33 

1.25 

1.20 

1.15 

1.12 

1.09 

1.07 

1.05 

1.10 

1.80 

1.55 

1.38 

1.26 

1.17 

1.10 

1.05 

1.01 

0.98 

0.95 

0.93 

1.20 

1.87 

1.57 

1.36 

1.22 

1.12 

1.05 

0.99 

0.94 

0.90 

0.87 

0.85 

1.30 

1.97 

1.61 

1.38 

1.21 

1.10 

1.01 

0.95 

0.89 

0.85 

0.82 

0.79 

1.40 

2.10 

1.68 

1.41 

1.22 

1.09 

0.99 

0.92 

0.86 

0.82 

0.78 

0.75 

1.50 

2.26 

1.77 

1.45 

1.24 

1.10 

0.99 

0.91 

0.84 

0.79 

0.75 

0.72 

1.60 

2.45 

1.87 

1.52 

1.28 

1.11 

0.99 

0.90 

0.83 

0.78 

0.73 

0.70 

1.70 

2.67 

2.00 

1.59 

1.32 

1.14 

1.01 

0.91 

0.83 

0.77 

0.72 

0.68 

1.80 

2.93 

2.15 

1.68 

1.38 

1.17 

1.03 

0.92 

0.83 

0.77 

0.72 

0.67 

1.90 

3.22 

2.32 

1.78 

1.44 

1.21 

1.05 

0.93 

0.84 

0.77 

0.71 

0.67 

2.00 

3.56 

2.51 

1.90 

1.52 

1.26 

1.08 

0.95 

0.85 

0.78 

0.72 

0.67 

c  =  1. 
Q 

0 
1.0 

1.1 

1.2 

1.3 

1.4 

P 

1.5 

1.6 

1.7 

1.8 

1.9 

2.0 

1.00 

1.00 

0.93 

0.88 

0.85 

0.83 

0.81 

0.80 

0.80 

0.79 

0.79 

0.79 

1.10 

1.01 

0.92 

0.85 

0.81 

0.78 

0.75 

0.73 

0.72 

0.71 

0.70 

0.70 

1.20 

1.05 

0.93 

0.84 

0.79 

0.74 

0.71 

0.69 

0.67 

0.66 

0.64 

0.64 

1.30 

1.11 

0.95 

0.85 

0.78 

0.73 

0.69 

0.66 

0.64 

0.62 

0.60 

0.59 

1.40 

1.18 

0,99 

0.87 

0.79 

0.72 

0.68 

0.64 

0.61 

0.59 

0.58 

0.56 

1.50 

1.27 

1.05 

0.90 

0.80 

0.73 

0.67 

0.63 

0.60 

0.58 

0.56 

0.54 

1.60 

1.38 

1.11 

0.94 

0.82 

0.74 

0.68 

0.63 

0.59 

0.57 

0.54 

0.52 

1.70 

1.50 

1.19 

0.99 

0.85 

0.76 

0.69 

0.63 

0.59 

0.56 

0.53 

0.51 

1.80 

1.65 

1.27 

1.04 

0.89 

0.78 

0.70 

0.64 

0.59 

0.56 

0.53 

0.51 

1.90 

1.81 

1.37 

1.10 

0.93 

0.81 

0.72 

0.65 

0.60 

0.56 

0.53 

0.50 

2.00 

2.00 

1.49 

1.18 

0.98 

0.84 

0.74 

0.66 

0.61 

0.56 

0.53 

0.50 
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inflated  by  a  factor  of  1.57.  If  the  error  distribution  is  actually 
double  exponential,  and  least  squares  error  is  used  (i.e.,  c=l ,  p=l ,  q=2), 
then  the  factor  is  2.00.  If  the  error  distribution  is  equally  likely  to 
be  normal  or  double  exponential,  the  best  compromise  value  for  q  is  1.35. 
This  q  gives  the  smallest  average  AVF  value  over  p=l  and  p=2. 

Table  3.2  contains  values  of  AVF  relative  to  the  best  alternative 
value  for  q  in  the  range  1-2,  that  is,  the  minimum  AVF  will  be  obtained 
when  q=p  by  the  maximum  likelihood  properties  of  ^.     If  each  value  of  AVF 
for  a  given  value  of  p  is  divided  by  AVF  for  p=q,  we  obtain  a  measure  of 
the  inflation  in  the  asymptotic  variance  caused  by  misspecification  of  the 
power  during  estimation.  For  both  the  normal  and  double  exponential  dis- 
tributions, when  p=q,  AVF=1  so  this  standardization  does  not  affect  the 
results  presented  in  the  previous  paragraph.  The  value  of  c  drops  out  of 
the  expression  for  the  ratio  expressed  in  Table  3.2.  Note  that  the  largest 
variance  inflation  occurs  when  using  least  squares,  and  the  true  error 
distribution  is  double  exponential.  The  best  average  inflation  over  the 
range  of  p  values  from  1  to  2  occurs  when  q=1.4,  the  worst  when  q=l ,  or 
q=2.  These  results  are  presented  in  the  form  of  contour  plots  in  the 
Appendix. 

For  the  power  family,  asymptotic  variance  considerations  indicate  a 
choice  for  p  in  fitting  of  1.4  or  1.5.  The  \p   functions  for  various  choices 
of  p  are  chosen  in  Figure  3.1. 

A  second  family  of  distributions  which  can  be  studied  for  the 
effects  of  misspecification  of  i|j  is  the  generalized  Lambda  Family  (GLF), 
introduced  by  Ramberg  et  al .  (1979).  The  density  function  has  four  para- 
meters and  is  represented  as  a  function  of  a  percentile  from  0  to  1 . 
Given  a  percentile,  say  p,  then  the  value  of  f  and  A  can  be  found  by  using 
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Table  3.2   Relative  AVF  inflation  in  the  Power  Family 


q   1.0   1.1   1.2   1.3   1.4   1.5   1.6   1.7   1.8   1.9   2.0 

1.00  1.00  1.01   1.05  1.09  1.15  1.21   1.28  1.34  1.42  1.49  1.57 
1.10  1.01   1.00  1.01   1.04  1.07  1.11   1.16  1.22  1.27  1.33  1.39 


1.20 
1.30 
1.40 
1.50 
1.60 
1.70 
1.80 
1.90 


1.05 
1.11 
1.18 
1.27 
1.38 
1.50 
1.65 
1.81 


1.01 
1.04 
1.08 
1.14 
1.21 
1.29 
1.39 
1.50 


1.00 
1.01 
1.03 
1.07 
1.11 
1.17 
1.23 
1.31 


1.01 
1.00 
1.01 
1.03 
1.05 
1.09 
1.14 
1.19 


1.03 
1.01 
1.00 
1.01 
1.02 
1,04 
1.08 
1.11 


1.06 
1.02 
1.01 
1.00 
1.00 
1.02 
1.04 
1.06 


1.09 
1.05 
1.02 
1.00 
1.00 
1.00 
1.01 
1.03 


1.13 
1.08 
1.04 
1.02 


00 
00 


1.00 
1.01 


1.18 
1.11 
1.05 
1.03 
1.01 
1.00 
1.00 
1.00 


1.22 
1.15 
1.09 
1.05 
1.03 
1.01 
1.00 
1.00 


1.27 
1.19 
1.12 
1.08 
1.05 
1.02 
1.01 
1.00 


2.00  2.00  1.62  1.39  1.25  1.16  1.10  1.05  1.03  1.01   1.00  1.00 
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the  equations  shown  below: 


(3.39)         f(p)  = 


[xVs-i  +  .  (1  .  p)M-l3 


(3.40)         A(p)  -  X^  ^   ^P'^  -  ^l  -  P^'^^ 


For  densities  in  the  GLF  to  have  a  zero  expectation, 


-1/X2     1/X2 


(3.41)         X,  =^-^    + 


1   1+^3     1+^4 


Properties  of  the  GLF,  and  its  relation  to  other  densities  are  presented 
in  Chapter  5. 

To  examine  misspecification,  attention  must  be  restricted  to  the 
densities  in  the  GLF  which  satisfy  the  asymptotic  properties  of  Chapter  2. 
Equation  (3.41)  specifies  X-,  and  X^  must  equal  x^  to  insure  symmetry.  This 
leaves  x^  and  x^  unspecified,  and  in  practice,  unknown.  To  fit  an 
M-estimate  using  a  yp   function  corresponding  to  an  MLE  estimator  for  a 
density  in  the  GLF,  one  needs  to  specify  X^  and  X^.  Figure  3.2  shows 
functions  for  various  choices  of  X^  and  X^.  For  X„  =  .1974  and  X^  =  .1341, 
the  GLF  distribution  closely  approximates  the  standard  normal  distribution, 
and  iJj(a)  -  A. 

Table  3.3  gives  values  of  AVF  for  combinations  of  values  of  X2  and  X^ 
in  a  chosen  ^   function  (rows  of  Table  3.3)  versus  values  of  x^  and  x^ 
in  the  true  error  desnity.  Symmetry  is  achieved  by  setting  x^  equal  to  X^ 
to  correspond  to  a  class  of  densities  for  which  the  results  of  Chapter  2 
will  apply. 
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The  tables  indicate  that  mi sspecifi cation  variance  inflation  is  more 
highly  dependent  on  X^  than  on  X,.  No  one  value  of  (X^.X^)  will  protect 
against  high  variance  throughout  the  parameter  space.  Least  squares  esti- 
mation corresponds  to  (X2,X2)  =  (.19, .13). 


CHAPTER  4 
CHOOSING  A  ^  FUNCTION 

Empirical  Studies 

The  largest  amount  of  research  in  the  use  of  M-estimators  has  been 
empirical  in  nature.  Recommendations  for  i)   functions  are  based  on  an 
evaluation  of  the  performance  of  the  i)   function  against  other  proposals 
for  a  variety  of  error  distributions  and  performance  criteria.  Several 
studies  have  appeared  in  which  a  ii   function  was  proposed  and  then  simulated 
against  various  alternatives.  The  largest  such  study  was  done  for  M-esti- 
mators of  location.  The  Princeton  Robustness  Study  (Andrews  et  al.,  1972) 
involved  sixty-five  estimators  (of  which  eighteen  were  M-estimators), 
fourteen  error  distributions,  and  eighteen  performance  criteria.  No  con- 
census was  reached  by  the  authors  as  to  the  superiority  of  a  single 
M-estimator,  but  this  may  be  due  to  the  stated  goal  of  the  study  to  find 
estimators  which  were  95%  efficient  in  the  normal  case  relative  to  the 
sample  mean  and  performed  well  against  all  other  alternative  error  distri- 
butions. No  such  estimator  was  found.  Stigler  (1977)  compared  seven  robust 
estimators  of  location  (of  which  five  were  M-estimators)  on  historical  data 
sets.  All  estimators  performed  well,  but  the  data  sets  used  did  not  contain 
outliers,  or  other  indications  of  non-normality. 

Andrews  (1974)  gives  the  first  empirically  reasoned  presentation  of  a 
ijj  function  for  regression  parameter  estimation.  Andrew's  \p,  the  so-called 
SINE  function: 
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(4.1) 


^(A)  = 


Sin  (f) 


0 


Ia|  <  OT 


A   >  CTT 


removes  from  consideration  any  residual  which  exceeds  ctt.  In  this  way, 
outliers  are  rejected  automatically.  Andrews  demonstrates  how  this  ^J^ 
function  finds  the  four  outliers  found  by  Daniel  and  Wood  (1971)  in  the 
stack  loss  data  (Brownlee,  1965).  The  constant  c  is  chosen  by  the  prac- 
titioner to  control  the  amount  of  trimming.  Justification  for  the  method 
is  presented  in  terms  of  the  stability  of  the  moments  of  '|^  under  five 
alternative  distributions,  its  ability  to  locate  outliers  and  remove  them, 
and  its  functional  simplicity  relative  to  the  Hampel  estimators  (Andrews 
et  al.,  1972). 

A  second  estimator,  the  bi-square  estimator,  is  proposed  by  Gross  (1976, 
1977).   Its  ^   function  is  given  below: 


(4.2) 


^(A) 


2  \2 


A(l  -  A^) 


0 


|A|  <  1 


|A|  >  1 


This  estimator  was  used  in  an  application  by  Beaton  and  Tukey  (1974). 
Gross  (1976)  compares  this  i)   in  the  location  problem  with  twenty-four  other 
estimators  and  seven  underlying  distributions  in  a  simulation  study  to 
determine  which  yields  confidence  intervals  which  are  true. 

Gross  (1977)  considers  the  bi-square  estimator  for  regression  under 
different  design  conditions,  sample  size,  and  error  distribution.  No 
comparisons  were  made  with  other  M-estimators. 
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Huber  (1964)  proposes  a  ^   function  which  is  the  MLE  for  a  normal 
distribution  with  exponential  tails.  The  estimator,  known  as  Huber  Pro- 
posal 2,  has  the  following  i|j  function: 


(4.3) 


'J^(A) 


k       A  >  k 

A      |A|  <  k 

■k      A  <  -k 


The  value  of  k  is  chosen  by  the  user.  Many  k  values  were  tried  in  the 
Princeton  robustness  study  in  evaluating  the  effectiveness  of  the  function 
for  estimating  location.  Denby  and  Mallows  (1977)  have  developed  graphical 
techniques  to  aid  in  the  choice  of  k  in  the  regression  problem. 

Two  large  Monte  Carlo  studies  comparing  4;  functions  in  regression  have 
appeared.  Denby  and  Larsen  (1977)  compare  twelve  estimators  (of  which  four 
are  M-estimators)  under  twenty-four  conditions  (three  underlying  error 
distributions  and  five  other  factors  either  present  or  absent,  using  a 
quarter  replicate  of  the  3x2^  combinations  of  factors)  measuring  effective- 
ness by  comparing  mean  square  error.  Andrew's  SINE  estimator  was  found 
to  be  superior  to  the  other  M-estimators  studied. 

Ramsey  (1977)  studied  the  L(p)  estimators  introduced  by  Forsythe  (1972), 
Their  ^   function  is 


(4.4) 


li^(A)  =  sign(A).  lAlP""* 


This  family  of  estimators,  indexed  by  p,  includes  ordinary  least  squares 
(OLS),  when  p=2,  and  LAV  when  p=l .  Asymptotic  variance  effects  when  p  is 
misspecified  in  fitting  relative  to  a  true  value  of  p  for  the  underlying 
error  distribution  are  studied  in  Chapter  3.  Ramsey  compared  these  esti- 


mators to  SINE  and  E  estimators.  The  '];  function  for  an  E,  estimator  is 
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(4.5)  ^(A)  =  A-e"^'^l 

Contaminated  normal  distributions  of  various  degrees  of  contamination  were 
used  as  the  error  distributions.  Several  other  factors  were  studied.  SINE 
and  E   estimators  were  found  to  be  superior  in  efficiency  compared  to 
L(p)  estimators  over  a  wide  range  of  these  conditions. 

Several  smaller  studies  have  measured  the  performance  of  M-estimators 
empirically.  Coffmann  and  Whiteside  (1980)  compare  OLS  and  median  regression 
(Sen,  1968).  Dodge  and  Lindstrom  (1981)  study  convex  combinations  of  LAV 
and  OLS.  Hill  and  Holland  (1977)  compare  LAV  and  SINE  when  the  errors  are 
contaminated  normal  and  prefer  SINE.  Hughes  (1980)  compares  LAV  and  OLS 
on  experimental  data  systematically  subdivided  into  replicates.  LAV  had 
smaller  variance  in  its  estimates  and  its  estimates  were  more  normal  than 
OLS. 

A  few  other  studies  have  used  M-estimates  on  data  produced  by  real 
life  phenomena.  Beaton  and  Tukey  (1974)  use  a  bi-square  estimator  on  band 
spectroscopic  data.  Agee  and  Turner  (1978)  use  a  Hampel  i>   function  to 
detect  outliers  in  trajectory  data  at  White  Sands  missile  range.  Rocke  and 
Downs  (1980)  compare  trimmed  means,  Huber  i|j,  and  bi-square  estimators  of 
location  on  chemistry  data  expanding  the  work  of  Stigler  (1973)  who  compared 
robust  estimates  of  location  as  mentioned  previously.  Lenth  (1980a,  1980b) 
has  used  M-estimators  of  location  on  directional  data  in  problems  of  air- 
craft location. 

Empirical  studies  of  regression  estimators  have  not  matched  the  Prince- 
ton Robustness  Study  of  estimators  of  location  for  thoroughness  either  in 
the  estimators,  the  alternative  error  distributions,  or  the  number  of 
criteria  with  which  the  estimators  can  be  compared.  Many  other  M-estimators 
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exist  which  have  never  been  tested  empirically,  such  as  several  of  the 
minimax  estimators  described  later  in  this  chapter. 

Questions  regarding  the  performance  of  M-estimators  have  also  been 
handled  empirically.  Carrol  (1979)  used  the  jackknife  technique  (Schucany, 
1979)  to  estimate  the  variances  of  regression  parameter  estimators  when 
the  underlying  distribution  is  asymmetric.  Hill  (1979)  followed  Huber  (1973) 
in  comparing  numerically  alternative  estimators  for  the  variance-covariance 
matrix  of  the  regression  parameter  estimators.  Welsch  (1975)  used  a  Monte 
Carlo  study  to  find  the  lengths  of  confidence  intervals  for  M-estimators. 
Ypelar  and  Velleman  (1977)  compare  the  effects  of  high  leverage  points  on 
four  M-estimators:  Bi-square,  Fair,  Dennis'  and  Welsch,  and  Huber. 

Empirical  studies  provide  much  needed  experience  in  the  use  of  M-esti- 
mators, but  have  not  provided  general  conclusions  that  would  indicate  the 
use  of  a  particular  ii   function.  A  practitioner  would  need  to  assume  that 
the  data  used  in  these  studies,  whether  it  was  historical  or  generated 
using  pseudo  random  numbers,  was  similar  in  distribution  to  the  data  to 
be  fitted  so  that  conclusions  derived  from  the  empirical  work  could  be 
applied.  This  assumption  is  still  quite  large  given  the  currently  available 
empirical  work. 

Asymptotic  Considerations 
A  second  approach  to  choosing  a  ^   function  would  be  to  use  an  asymp- 
totic argument.   In  Chapter  2  proofs  were  referenced  which  indicate  that 
for  a  given  error  distribution,  all  ij;  functions  will  produce  estimates 
which  converge  almost  surely  to  the  true  regression  parameter  values. 
Differences  between  the  ^   functions  involve  the  asymptotic  variance  of 
the  estimates,  which  is  proportional  to  the  AVF  for  the  it   in  use  and  the 
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true  error  density.  Recall  that 


(4.6) 


i|>2fdA 


MH^,f)    = 


li^'fdA 


A  ^   function  could  be  chosen  to  provide  low  values  of  AVF  for  a  class  of 
possible  error  distributions.  Specifically,  ^   might  be  chosen  to  minimize 
the  average  relative  AVF,  or  the  maximum  relative  AVF. 

In  Chapter  3,  the  AVF  of  the  power  family  was  presented  for  various 
alternatives  of  '^  and  f  within  that  family.  If  the  error  distribution 
can  be  assumed  to  be  from  the  power  family,  and  one  wishes  to  use  an  L(p) 
estimator,  the  best  choice  of  power  for  ijj  is  .4,  since  that  value  minimizes 
the  average  relative  AVF  across  the  range  of  powers  for  the  error  density. 
The  value  .4  minimizes  the  maximum  relative  AVF  at  1.18  (see  Table  3.2). 

The  AVF  of  the  symmetric  GLF  was  also  presented  in  Chapter  3.  This 
is  a  much  broader  class  of  distributions  than  for  the  power  family.  No 
one  value  of  (A^jA^)  could  be  found  which  would  protect  well  against  all 
alternative  pairs. 

AVF  values  can  be  computed  analytically  for  many  combinations  of  i|) 
and  f,  not  just  combinations  within  a  parametric  family  of  distributions. 
The  AVF  values  indicate  the  sample  sizes  needed  to  get  estimates  with 
equal  asymptotic  precision.  The  bound  on  an  estimate  of  B^,   the  kth  element 
of  3  is 


(4.7) 


B 


C(X'X)-1  -AVF 


To  achieve  equal  bounds  using  two  different  ip   functions  would  require 
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AVF, 


AVF. 


(4.8) 


(4.9) 


^ 

v/n2 

"1  _ 

"2 

AVF^ 
AVF| 

For  example,  in  the  power  family,  if  the  true  power  is  .2  and  100  points 
are  used  to  estimate  B  with  ii   power  of  .4,  then  182  points  would  be  needed 
to  estimate  6  with  the  same  precision  using  least  squares. 

Comparing  AVF  values  will  only  be  feasible  for  those  distributions 
and  111   functions  satisfying  the  regularity  conditions  for  the  existence  of 
the  asymptotic  normality  of  6*,  that  is 

1 .  f  symmetric. 

2.  p  must  be  convex 

3.  f  must  be  a  continuous  derivative  of  F. 

4.  3z  3  f(z  )  >  0  and  p(z„  +  6)  +  p(z  -  6)  >  2p(z  )  V5  >  0. 

0      0  0  u  " 

Condition  1  above  is  a  serious  limitation  in  the  application  of  asymp- 
totic variance  techniques  to  experimental  data  situations  which  may  be 
known  to  have  asymmetric  behavior.  Asymptotic  behavior  for  these  i) 
functions  has  not  been  established. 

For  some  ^   functions,  establishing  AVF  can  be  done  analytically.  For 

least  squares, 


(4.10) 


iJ;^fdA  = 


A^fdA  =  Var(A) 


(4.11) 


fdil)  = 


fdA  =  1 


so  AVF(OLS,f)  =  Var(A)  for  any  distribution  f. 
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For  LAV, 
(4.12) 

(4.13) 


4;^fdA 


J 


fdA  =  1 


fdii;  =  2f(o) 


SO  AVF(LAV,f)  =  l/(4f(o)^)  for  any  distribution  f.  Thus  LAV  will  outper- 
form OLS  in  asymptotic  variance  for  any  error  distribution  for  which 


(4.14) 


1 


4f(o)- 


<  Var(A) 


Asymptotic  variance  expressions  provide  an  analytic  means  for  comparing 
Tp   functions,  but  are  severely  limited  in  the  choices  of  i|)  and  f  which  can 
meet  the  regularity  conditions  sufficient  to  insure  an  AVF  of  the  form  in 
(4.6). 

Other  asymptotic  behavior  measures  can  be  computed  from  the  influence 
curve  of  an  estimator.  Influence  curves  for  M-estimators  of  location  are 
given  by  Equation  (2.13).  Hampel  (1974)  defines  six  additional  optimality 
criteria  for  estimators  in  terms  of  the  influence  curve.   In  addition,  a 
table  is  presented  in  which  seventeen  estimators  are  compared  on  eight 
criteria  at  the  normal  distribution. 

Minimax  Estimators 
For  distributions  in  classes  which  satisfy  the  regularity  conditions 
of  Theorem  2.7,  minimax  estimators  exist.  Huber  (1964)  introduced  the 
concept  of  a  minimax  M-estimator  of  location.  Polyak  and  Tsypkin  (1978) 
extend  the  notion  to  M-estimators  of  regression  parameters.  When  F  is  a 
class  of  distributions  satisfying  the  conditions  of  Theorem  2.7,  f^  e  F  is 
the  distribution  with  minimum  Fisher  information,  and  ijj  =  -'^^/'^q->   then 
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(4.15)  AVF(i|;Q,f)  <  MH^Q,fQ)  Vf  e   F 


Thus  i)    will  provide  a  minimax  estimator  of  6.  So  for  certain  families,  f 
protection  is  afforded  by  using  the  ^   which  is  MLE  for  the  least  infor- 
mative distribution  in  F.  Ershov  (1979)  gives  several  examples  of  families 
and  their  resulting  minimax  estimators  (see  Table  4.1  below). 

Little  work  has  been  done  in  applying  these  estimators  except  in  cases 
where  they  correspond  to  estimators  developed  using  other  criteria.  Classes 
F  with  more  specific  restrictions  should  yield  interesting  choices  for  ip. 
Minimax  estimators  need  to  be  refined.  Classes  of  distributions  can  be 
more  completely  specified  than  has  been  done  in  the  past.  Using  calculus 
of  variation  for  finding  new  -^   functions  should  yield  better  estimators 
the  more  completely  one  specifies  the  class.  Error  distributions  which 
are  feasible  for  experimental  data  can  be  well  specified  as  a  class.  These 
new  estimators  need  to  be  compared  to  previously  proposed  and  numerically 
tested  estimators.  Conservative  confidence  bounds  imposed  using  the  mini- 
max principle  need  to  be  studied.  Small  sample  properties  of  many  minimax 
estimators  are  unknown. 

Table  4.1   Classes  of  distributions  and  their 
minimax  estimator 

Class  Estimator 

All  distributions  with  f(o)  >  0  LAV 

All  distributions  with  Var(A)  <  °°  OLS 

All  contaminated  normal  distributions  Huber  Proposal  2 

All  distributions  on  [-a, a]  4^(A)  =  tan(TTA/2a) 

Minimax  considerations  can  only  be  applied  within  the  classes  of  distri- 
butions which  satisfy  the  conditions  of  Theorem  2.8,  the  most  serious 
limitation  of  which  is  symmetry. 
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Adaptive  Estimators 

In  the  three  approaches  to  choosing  a  ii   function  presented  in  this 
chapter  so  far,  the  selection  of  \p   is  made  a  priori  to  data  analysis.  To 
avoid  such  a  possibly  arbitrary  decision,  one  may  wish  to  use  a  procedure 
which  leaves  4^,  and  thus  f,  partially  unspecified  at  the  beginning  of 
fitting.  This  would  allow  the  data  to  adapt  the  i)   function  to  one  which 
fits  the  data  best.  Two  possible  methods  for  achieving  this  exist.  The 
first  method  is  to  use  a  preliminary  fit  of  the  data  to  indicate  which  ij; 
function  is  to  be  used  in  subsequent  fitting.  A  study  using  such  a  tech- 
nique is  described  below. 

A  second  technique  involves  using  a  ^   function  which  contains  unknown 
parameters.  Andrew's  SINE  function  contains  a  parameter  c,  which  could  be 
fitted  using  a  maximum  likelihood  procedure  along  with  regression  parameters. 
This  technique  was  used  in  the  Princeton  Robustness  Study  of  location  esti- 
mators. Chapter  5  contains  the  results  of  applying  the  technique  in 
regression. 

Using  a  preliminary  fit  in  order  to  choose  a  ijj  function  is  described 
by  Moberg  et  al.  (1980),  who  use  LAV  to  develop  residuals,  then  calculate 
Qo  and  Q.  as  defined  below. 

(4.16)  Q    U(.05)  -M(.50) 

^       M(.50)  -  L(.50) 

(4.17)  U(.05)-L(.05) 

U(.50)  -  L(.50) 

where  L(y),  U(y),  M(y)  are  the  arithmetic  means  of  the  smallest,  largest, 
and  middle  n  of  the  ordered  residuals,  respectively.  The  location  of 
(Q,,Q-)  in  a  plane  divided  into  five  regions  is  then  determined  in  order 
to  select  the  ^j  function  which  corresponds  to  the  region  in  which  (03,0^) 
lies. 
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The  performance  of  this  adaptive  procedure  was  studied  by  comparison 
to  OLS,  LAV,  SINE,  and  a  one-step  SINE  estimator.  Ten  error  distributions 
were  used,  and  the  measure  of  performance  was  a  median  efficiency  ratio 
(MER)  based  on  total  squared  error  (TSE),  where 


(4.18)  TSE  =     I     {&*,  -   &.y 


j=l  ^ 


(4  ^g)        fviFR  =  median  (TSE  of  OLS) 

^  ■  ■'  median  (TSE  of  adaptive  procedure) 


The  adaptive  procedure  outperformed  all  competitors  except  the  case  of 
least  squares  with  normal  errors.  The  median  efficiency  ratio  of  the 
adaptive  procedure  ranged  from  .85  to  47.197. 


CHAPTER  5 
CONTINUOUSLY  ADAPTIVE  M-ESTIMATORS 


In  Chapter  4  several  techniques  for  choosing  a  '|i  function  were  pre- 
sented. Each  technique  included  choosing  a  ^   function  either  from  a 
limited  set  of  functions  based  on  a  calculation  from  a  preliminary  fit, 
or  a  priori  according  to  a  previously  established  performance  criteria. 
In  this  chapter  a  new  technique  is  introduced  which  does  not  involve 
choosing  a  ^   function. 

Continuously  adaptive  M-estimation  is  a  method  of  simultaneously 
fitting  a  regression  model,  as  in  Equation  (1.1)  and  estimating  parameters 
of  the  error  distribution.   In  OLS,  one  parameter,  a^,  of  the  error  distri- 
bution is  estimated.  The  basic  shape  of  the  error  distribution  is  fixed, 
as  in  any  M-estimation.  Table  1.1  gives  the  correspondence  between  ijj 
functions  and  the  distributional  assumptions  they  represent.  One  would 
prefer  to  use  a  regression  fitting  method  which  did  not  presuppose  a  shape 
for  the  error  density.  Existing  adaptive  methods  (Hogg,  1974)  fix  a  finite 
set  of  alternative  error  distributions.  The  new  technique  will  allow  the 
error  distribution  to  be  any  member  of  a  family  of  distributions  which  can 
be  written  using  one  functional  form. 

A  family  of  distributions  which  would  be  suitable  for  a  continuously 
adaptive  M-estimator  must  have  the  properties  listed  below. 

1.  It  has  a  single  functional  form. 

2.  It  covers  a  wide  range  of  skewness  and  kurtosis  values. 

3.  It  has  known  distributional  properties. 
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The  first  requirement  guarantees  that  an  algorithm  can  be  produced 
to  fit  the  parameters,  and  is  a  natural  assumption  for  the  experimenter. 
Distributional  families  such  as  the  Pearson  curves  (Solomon  and  Stephens, 
1978),  Johnson  curves  (Johnson,  1949),  and  Burr  distributions  (Burr,  1942) 
are  not  appropriate,  as  a  member  of  the  curve  family  must  be  chosen  prior 
to  estimation. 

The  second  requirement  insures  that  the  error  process  can  be  adequately 
modeled  through  its  first  four  moments.  Pearson,  Johnson,  and  Burr  (1979) 
compare  distributions  having  the  same  first  four  moments.  They  are  encour- 
aged about  the  similarity  of  the  percentiles  of  such  distributions,  but 
warn  about  tail  behavior. 

The  third  requirement  is  an  obvious  prerequisite  to  drawing  conclusions 
from  such  a  fit.  To  make  confidence  intervals  and  detect  outliers  using 
probability  statements,  we  need  to  have  knowledge  of  distributional  properties. 

Tadikamalla  (1980)  reviews  six  available  techniques  for  generating  non- 
normal  variates  using  a  computer.  The  generalized  lambda  family  of  distri- 
butions (Ramberg  et  al . ,  1979)  and  the  Fleishman  proposal   (Fleishman,  1978) 
have  a  single  functional  form  and  cover  a  wide  range  of  skewness  and  kurtosis 
values.  Fleishman  suggests  generating  non-normal  variates  using 

(5.1)  A  =  a  +  bz  +  cz^  +  dz^ 

where  z  is  a  normal  random  variate  with  mean  0  and  variance  1.  Tables  are 
presented  relating  the  parameters  a,  b,  c,  and  d  to  the  first  four  moments 
of  the  distribution.  No  analytic  results  are  available.  No  tail  character- 
istics are  known. 

Ramberg  et  al .  (1979)  present  a  distribution  parameterized  by  its  per- 
centiles. This  four  parameter  family  is  given  by 
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^3  .  n  .  n/4 


(5.2)  A  =  X^  +  P   -  {^  -  P) 


(5.3)  f(A)  =       2 


X3P^3-1  ,  ,^(1  .  p)M-l 


where  p  is  between  0  and  1.  Random  variables  are  generated  by  first 
generating  p  as  uniform  random  variate,  and  then  using  (5.2).  Density 
curves,  such  as  Figure  2.1,  can  be  drawn  by  varying  p  from  0  to  1  and  for 
each  p,  calculating  A  and  f(A)  from  (5.2)  and  (5.3)  and  plotting  the  pair. 
The  properties  of  this  family  are  covered  in  depth  later  in  this  chapter. 

Having  chosen  a  suitable  parametric  representation  for  the  error  dis- 
tribution, a  method  of  estimating  the  parameters  must  be  selected.  M-esti- 
mation  involves  estimating  the  regression  parameters  from  Equations  (1.6). 
For  given  X,,  X„,  X-,  and  X.,  i|;(A)  could  be  calculated  from  (5.3)  and  used 
in  (1.6).  When  X-,,  X„,  X-,  and  X.  are  unknown,  as  in  a  data  fitting  prob- 
lem, simultaneous  estimation  of  3  and  X  =  (X,  ,X„,X-,X.)  is  necessary. 
Difficulties  in  estimating  6  and  a^  for  nonhomogeneous  ^   were  referenced 
in  Chapter  1.  One  method  of  estimating  the  parameters  is  maximum  likelihood 
estimation.  One  needs  to  maximize  the  likelihood  function 

n 

(5.4)  L  =  n  f(y  -  x^B;X) 

i=l    ^   ~^~  ~ 

over  6  and  X.  Using  the  maximum  likelihood  theory,  one  can  obtain  confidence 
intervals  for  6  and  X,  and  covariance  estimates.  Maximum  likelihood  esti- 
mation is  known  to  be  sensitive  to  departures  from  the  assumption  that  the 
likelihood  is  of  the  form  in  (5.4)  (Huber,  1965).   In  data  analytic  situa- 
tions, the  assumptions  implicit  in  (5.4)  are  quite  reasonable,  including 
many  possible  error  distribution  alternatives.  The  assumption  of  a  linear 
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model  is  not  needed  in  the  development  which  follows.  If  y  =  m(x^-'2)  +  A 
is  the  assumed  model,  the  likelihood  can  be  written 

n 

(5.5)  L  =  n   f(y.  -  m(x.,B);A) 

and  maximized  over  6  and  X.  An  example  of  a  nonlinear  fit  is  given  in  this 
chapter. 

Properties  of  the  generalized  lambda  family  are  described  in  the  next 
section.  General  applicability  of  maximum  likelihood  estimation  and  conse- 
quences follow.  Algorithms  for  calculating  (5.4)  are  presented  followed  by 
examples  of  the  procedure  being  used  to  fit  a  variety  of  different  problems. 

The  Generalized  Lambda  Family 
Ramberg  et  al.  (1979)  introduced  the  generalized  lambda  family  (GLF) 
as  a  generalization  of  the  lambda  family  proposed  by  Tukey  (1950).  Equations 
(5.2)  and  (5.3)  define  the  density  which  does  not  exist  in  closed  form. 
Expressions  for  the  first  four  central  moments  are  reproduced  below. 

(5.6)  U  =  X^  +  A/A2 

(5.7)  a^  =  CB  -  k')/\\ 

(5.8)  1^3  =  (C  -  3AB  +  2h^)/\\ 

(5.9)  P4  =  (D  -  4AC  +  SA^B  -  3A^)/X5 

where 

(5.10)  A  =  1/(1  +  X3)  -  1/(1  +  X4) 

(5.11)  B  =  1/(1  +  2X3)  -  2  BETA(1  +  X3,l  +  X^)  +  1/(1  +  2\q) 
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(5.12)  C  =   1/(1   +  3X3)   -   3  BETAd   +  2X^,^   +  X^) 

+  3  BETAd   +  ^3,1   +  2X3)   -   1/(1   +  3X4) 

(5.13)  D  =   1/(1    +  4X3)   -   4  BETAd    +  3X3,1   +  X^) 

+  6  BETAd   +  2X3,1   +  2X4)   -   4  BETAd   +  ^3,1+  3X4) 

+  vd  +  4x4) 

and  BETA(a,b)  =  r(a)r(b)/r(a  +  b). 

These  equations  imply  that  to  use  GLF  distributions  with  mean  zero,  we 

must  have 

-1/X,     1/X, 

(5.14)  X,  =  ^  +  ?— 

1  +  X3     1  +  X4 

The  skewness  03  =  1-13/03  and  kurtosis  a.  =  y^/a"*  are  functions  of  X3  and  X. 
alone. 

Lam,  Bowman,  and  Shenton  (1980)  present  contour  plots  of  a3  and  a, 
for  axis  of  y  and  6,  where  y  =  X^  and  6  =  ^3/^4-  When  5  =  1,  the  distri- 
bution is  symmetric.  For  the  original  parameterization,  the  skewness  and 
kurtosis  contours  are  given  in  Figures  5.2  through  5.5.  Figure  5.1  shows 
the  area  of  the  (a3,a4)  plane  covered  by  GLF  distributions. 

By  matching  the  first  four  moments  of  a  distribution  using  the  tables 
of  Ramberg  et  al.  (1979),  one  can  get  a  X  vector  which  will  produce  a  GLF 
distribution  which  will  be  close  to  the  desired  distribution.  In  Table  5.1 
several  distributions  are  given  with  their  GLF  approximations. 

In  a  similar  manner,  various  \p   functions  can  be  approximated  by  matching 
the  first  four  moments  of  their  corresponding  maximum  likelihood  densities 
from  Table  1.1  to  a  density  from  the  GLF.  The  ^i   function  for  a  density  from 
the  GLF  can  be  evaluated  numerically  using  Equation  (1.7).  Table  5.2  gives 
several  GLF  approximations  to  ip   functions  from  Table  1.1. 
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Figure  5.1  Coverage  of  (a^sa.)  by  GLF  distributions, 
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Table  5.1 

GLF  approximation  to 

densities. 

GLF  Approximation 

Distribution 

^1 

^2 

X3 

^4 

Normal  (0,1) 

0 

.1914 

.1341 

.1341 

Shifted  Exponential 

-.993 

-.1081 

-.0407 

-.1075 

Logistic 

0 

-.0659 

-.0353 

-.0353 

Gamma  (2,2) 

-.782 

.0379 

.5503 

.0355 

Uniform  (-1,1) 

-1 

1 

1 

1 

Table  5.2   GLF  approximations  to  '|;  functions. 


ifj  Function 


OLS 
LAV 
A/(l  +  (A/5)^) 


GLF  Approximation 

^1 

^2 

^3 

^4 

0 

.1974 

.1341 

.1341 

0 

-.1586 

-.0802 

-.0802 

0 

.4917 

.4092 

.4092 
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Only  certain  values  of  X   will  produce  a  density  function  which  is 
greater  than  or  equal  to  zero  over  the  real  line,  and  integrates  to  one. 
The  acceptable  values  of  X  depend  only  on  X^   and  X^.  The  regions  of 
acceptable  values  are  shown  in  Figure  5.6.  The  formula  for  the  kth  non- 
central  moment  is  given  by  Ramberg  et  al .  (1979)  as 


k 


(5.15)     E(a'')  --^      I 


A^   i=0 


k 


.  r(x^(k-i)+i)  r(x,i+i) 

(-1)  — 

r(X3k  +  (x^-X3)i  +  2) 


when  -1/k  <  min(X3,X^)  all  positive  moments  of  order  less  than  or  equal 
to  k  will  exist.  For  this  reason,  the  contour  plots  of  skewness  and 
kurtosis  were  drawn  in  quadrant  I  of  the  {XyX^)   plane,  and  for  -.25  < 
X3,X4  <  0  in  quadrant  III  of  the  {X^,X^)   plane,  where  skewness  and  kurtosis 

exist. 

The  range  of  the  distribution  is  also  a  function  of  X.  Table  5.3 
summarizes  the  ranges.  In  quadrant  I  of  the  (X3,X^)  plane  (row  1  of  Table 
5.3),  GLF  distributions  have  finite  range  determined  by  X^  and  X2.  The 
normal  distribution  is  closely  approximated  by  a  GLF  distribution  in  quad- 
rant I,  with  X^  =  0  and  X2  =  .1941  (see  Table  5.1).  These  parameters 
determine  the  range  for  the  GLF  approximation  of  the  normal  distribution 
to  be  (-5.152,5.152).  The  normal  distribution  has  .9999996  of  its  prob- 
ability mass  within  this  interval.  Range  dependencies  on  X  are  given  in 

Table  5.3. 

Figure  5.6  and  Table  5.3  indicate  a  shortcoming  of  the  GLF  for  its  use 
with  maximum  likelihood  estimation.  To  maximize  Equation  (5.4)  over  B  and 
X  involves  maximizing  the  likelihood  over  the  regions  shown  in  Figure  5.6. 
No  restrictions  on  3  will  be  needed.   Restriction  of  X^  will  beas  shown  in 
Table  5.3.  Each  of  the  four  regions  in  Figure  5.6  could  be  used  individually 
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Figure  5.6  Moments  of  GLF  distributions, 
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Table  5.3   Range  of  GLF  distributions. 


^2 

^3 

A4 

>  0 

>  0 

>  0 

>  0 

=  0 

>  0 

>  0 

>  0 

=  0 

<  0 

<  0 

<  0 

<  0 

=  0 

<  0 

<  0 

<  0 

=  0 

Lower 
Limit 

Upper 
Limit 

A-,  -  l/X^ 

X^  +  l/X^ 

^1 

X^  +  VX^ 

x^  -  17X2 

^1 

+  00 


X,  +  °° 


^1 
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to  develop  maximum  likelihood  estimates.  The  overall  maximum  of  the  four 
regions  is  the  GLF  maximum  likelihood  estimator. 

Table  5.3  presents  an  additional  difficulty  in  that  the  range  of  the 
error  distribution  may  depend  on  the  parameter  values.  This  will  violate 
the  assumptions  sufficient  to  insure  that  the  maximum  likelihood  estimators 
B*  and  A*  will  be  asymptotically  normally  distributed  and  efficient.  In 
the  third  quadrant  of  the  (X^,A.)  plane,  the  range  is  (-co,+oo)  independent 
of  X. 

Maximum  Likelihood  Estimation 
The  use  of  maximum  likelihood  estimation  will  allow  general  results 
for  these  estimators  to  be  applied  to  B*  and  X*.  Let  9  =  (Bq^B-,  , . . .  ,B|^^, 
Xp,X-,,X«),  9  £  0,  the  parameter  space,  and  let  9*  be  the  maximum  likelihood 
estimate  from  Equation  (5.4).  Then  if 


(5.16) 


3  log  L 


39. 


-  0 


i  =  l,2,...,k+4 


(5.17) 


3'  log  L 


38.  39. 


3  log  L 

36. 
1 


3  log  L 

39. 
J 


i,j=l,2,...,k+4 


9*  will  be  asymptotically  normally  distributed  with  mean  0  and  variance 
covariance  matrix  whose  elements  are  given  by  (5.17).  9*  is  also  efficient 
(Kendall  and  Stuart,  1973). 

For  the  estimation  problem  of  (5.4),  these  regularity  conditions  hold 
for  X^  <  0,  X  <  0.  When  either  X^  or  X.  is  positive,  the  range  of  the 
distribution  is  a  function  of  X  and  such  a  relationship  will  violate  (5.16). 
In  the  third  quadrant  of  the  (X2,X^)  plane,  the  GLF  will  have  a  bounded 
continuous  density  which  is  twice  differentiable.  This  will  imply  (5.16) 
and  (5.17).  One  can  use  (5.17)  to  obtain  estimates  of  the  asymptotic 
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variance  covariance  matrix  elements  by  evaluating  3  log  L/36.  numerically 
as  needed  (Kennedy  and  Gentle,  1980). 

The  asymptotic  results  of  Chapter  2  do  not  apply  to  continuously 
adaptive  M-estimators,  since  the  ^   function  for  these  estimators  is  not 
fixed,  but  rather  depends  on  X. 

Evaluating  the  Likelihood 
Evaluating  Equation  (5.4)  in  a  given  regression  problem  is  not  straight- 
forward. For  a  specific  set  of  regression  data,  y.  and  x.  are  known.  The 
likelihood  function  depends  on  9  as  defined  previously.  At  a  given  9, 
calculation  of  the  likelihood  involves  several  steps. 

(5.18)      Evaluate     r^  =  y^.  -  x^S 


(5.19)  Evaluate     f.  =  f(r.;X) 

n 

(5.20)  L  -  n  f. 

i  =  l  ^ 


Calculation  of  f.  from  r.  and  X   can  be  done  using  three  different 
algorithms,  recalling  that  the  relationship  in  (5.19)  is  defined  parametri- 
cally  in  terms  of  a  percentile  p-  by  Equations  (5.2)  and  (5.3),  The  problem 
reduces  to  finding  the  root  of  a  nonlinear  function,  g(p.),  where 


(5.21)   g(p.)  =  p^3  .  (T  .  p.)A4  +  ^^^^^    .   ^ . ) 


We  wish  to  find  p.  so  that 


(5.22)  g(p.)  =  0 


subject  to 


(5.23)  0  <  p.  <  1 
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wh 


ere  A,  is  given  by  (5.14)  so  that  E(A)  =  0.  Once  p.  is  found,  (5.3)  can 


be  evaluated  for  f..  Functions  given  by  (5.21)  are  shown  in  the  Appendix 
for  various  combinations  of  X.     Without  loss  of  generality,  the  curves  are 
presented  for  X-.   and  r.  both  set  to  zero. 

A  straightforward  algorithm  for  computing  p.  so  that  g(p-)  =  0  given 
Xr,,X^,X.,   and  r.  is  presented  below 

Algorithm  5.1   (Solve  (5.22)  using  binary  search) 

(5.24)  Set  p,  to  0.  Set  p,  •  to  1. 
'  ^lo  hi 

(5.25)  Set  p  to  (p^Q  +  p^.)/2 

(5.26)  If  g(p)  >  0    then  set  p   to  p 
If  g(p)  =  0    then  stop 

If  g(p)  <  0    then  set  p  '  to  p 

(5.27)  Go  to  (5.25) 

This  algorithm  will  converge,  but  requires  many  time  consuming 
evaluations  of  (5.21).  A  faster,  but  less  reliable  algorithm  is  an 
adjusted  -  Newton  algorithm. 

Algorithm  5.2  (Solve  (5.22)  using  an  adjusted  Newton  method) 

(5.28)  Set  p  to  .5 

(5.29)  If  g(p)  ^  0    then  stop 


(5.30)       Set  p^g^  to  p  -  g(p)/g'(p) 


P    -  1 

(5.31)  If  P,ew  >  ^    then  set  p^^^  to  p+(l-p)  ^^^^ 

(5.32)  If  p_,.,  <  0   then  set  p_,„  to  p+p   ""^ 


new  -new       p-p^^^ 


(5.33)        Set  p  to  p^g^.  Goto  (5.29) 
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The  adjustments  in  (5.31)  and  (5.32)  are  necessary  to  meet  the 

conditions  of  (5.23).  The  Newton  step  in  (5.30)  may  produce  a  p^^^  outside 

the  permissable  range  of  values  for  the  solution.  If  p^^^  is  too  high, 

the  adjustment  in  (5.31)  moves  p^^^  close  to  one  if  the  Newton  value  for 

D    was  much  larger  than  one.   If  p„^,,  is  as  far  past  one  as  p  is  less 
^new  new 

than  one,  then  p    becomes  half  the  distance  from  p  to  one.  Proportional 
new 

adjustments  are  made  in  an  analogous  manner  in  (5.32)  when  p^^^  is  less 

than  zero. 

This  algorithm  is  faster  than  Algorithm  5.1  if  the  root  is  near  p=.5. 
As  the  root  moves  closer  to  0  or  1 ,  Algorithm  5.1  is  faster.  The  number 
of  evaluations  of  (5.29)  required  by  each  algorithm  under  various  conditions 
are  given  in  Table  5.4.  Without  loss  of  generality,  X^  is  considered  fixed 
at  0.1.  While  Algorithm  5.2  is  often  faster  than  Algorithm  5.1,  its 
advantage  diminishes  whenever  the  root  is  near  the  boundary.  As  a  general 
algorithm,  average  time  over  the  range  of  possible  roots  is  lower  with 

Algorithm  5.1 . 

For  calculation  of  the  likelihood  at  a  particular  value  of  6  and  X, 
(5.21)  must  be  solved  repeatedly  for  each  point  (y^-,x.)  in  the  regression 
data.  This  process  can  be  greatly  abbreviated  by  noting  that  when  6  and  A 
are  fixed,  the  value  of  g(p.)  depends  only  on  r . ,  and  r.  effects  g(p.)  in 
a  simple  linear  manner.  This  suggests  that  if  an  approximation  to  g(p^.) 
can  be  found  which  can  easily  be  solved  for  a  particular  r . ,  then  altering 
the  approximation  for  a  subsequent  r.  can  be  achieved  by  a  simple  linear 
translation.  The  highest  degree  polynomial  which  can  be  solved  using 
explicit  formulas  is  a  quartic.  Examinations  of  g(p)  for  various  X,   and 
initial  approximations  involving  cubic  polynomials  indicate  that  a  cubic 
polynomial  approximation  is  sufficient.  Comparisons  of  improved  accuracy 
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Table  5.4   Comparison  of  Algorithms  5.1  and  5.2 

Evaluations  of  (5.21) 


r 

^3 

^4 

A5.1 

A5.2 

Root 

1 

.01 

.01 

15 

10 

.99997 

3 

.01 

.13 

10 

7 

.9816 

-2 

.01 

.40 

7 

4' 

.2109 

0 

.13 

.01 

5 

3 

.40625 

9.9 

.13 

.13 

15 

>  100 

.99997 

-.5 

.13 

.40 

8 

3 

.4570 

2 

.40 

.01 

7 

4 

.7891 

1.3 

.40 

.13 

9 

3 

.6387 

-2.1 

.40 

.40 

9 

3 

.33008 
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of  a  quartic  approximation  versus  the  faster  cubic  approximation  are  a 
subject  for  future  research. 

Suppose  h(p)  is  a  cubic  polynomial  which  closely  approximates  g(p)  for 
a  particular  r.,  that  is 

(5.34)      g(p)  ==  h(p)  =  ag  +  a^p  +  a^p'  +  a3P^ 


Let 


(5.35)  g.(p)  =  p^3  _  (T  .  p)A4  ^  ^^^^^  _  ^_^ 
then 

(5.36)  g.(p)  -  g^(p)  =  \^{r.   -   r.) 

So  if  h^(p)  ==  g^.(p),  then 

(5.37)  g^.(p)  ::  hj.(p)  =  a^  -  X^ir.   -  i^.,- )  +  a^p  +  a^p^  +  a3P^ 


that  is,  only  the  constant  term  in  the  cubic  approximation  is  to  be  altered. 
This  effect  is  shown  for  a  particular  g.  and  g.  in  Figure  5.7.  While  this 
simple  linear  translation  is  a  trivial  relationship  between  g.  and  g.,  it 
does  not  suggest  a  simple  relationship  between  the  root  of  g.  and  the  root 
of  g..  If  h.  and  h.  are  suitable  cubic  approximations,  then  the  roots  of 
h.  and  h.  are  easily  found  and  will  approximate  the  roots  of  g.  and  g. 
respectively.  Thus,  we  can,  without  loss  of  generality,  find  a  cubic 
approximation  to  g(p)  for  r.  =  0.  Call  this  function  to  be  approximated  g„, 
that  is 


(5.38)  '-   gQ(p)  =  p^3  _  (1  .  p)M 


Two  choices  for  finding  approximating   cubic  polynomials  have  been 
considered.  The  first  is  a  Taylor  series  of  g„  expanded  around  p  =  1/2, 
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which  is  the  center  of  the  neighborhood  in  which  gn(p)  is  to  be  approxi- 


mated. The  derivation  of  the  cubic  Taylor  series  is  given  below 

^0 


(5.39)      gQ(p)  ^  gQ(l/2)  +  gQ(l/2)(p  -  1/2)  + 


g^'(l/2)(p  -  1/2)' 


gQ'-(l/2)(p  -  1/2)^ 


(5.40)      gQ(l/2)  =  (1/2)^3  _  ^y2)M   +  ^^^2 


(5.41)      gQ(p)  =  X3pV^  +  A^d  -  p)^4-l 


(5.42)      g-(p)  =  XJX^  -   1)  p^3-2  +  X.(X.-1)(1  -  p)V2 


^3^  3 


4^  4 


(5.43)      g;:^(p)  =   A,(X,-l)(A,-2)  p^3  "  3  +  X,(A,  -  1  )(X, -2)  (1  -  p)^4  -  3 


'3^  3 


4^'^4 


(5.44)      (p  -  1/2)'  =  p'  -  p  +  1/4 


(5.45)       (p  -  1/2)3  =  p3  -  (3/2)p'  +  (3/4)p  -  (1/8) 


Let 


g^°^  =  gn(V2)  ,  g^^)  =  g;;(l/2)  ,  g^2)  =  gni/2)  ,  g^^^  =  9^0/2), 


(2) 
(5.46)   gQ(p)  =  g^°)  +  g^^)(p  -  1/2)  +   5__  [p^  .  p  +  (1/4)] 


(3) 

+  ^  [p'  -  (3/2)p'  +  (3/4)p  -  (1/8)] 


(5.47)   gQ(p)  :  p^ 


6 
,(3) 


+  P' 


,(2) 


1  J3) 


+  P 


'      2     8 


,(0)  -la(l).ia(2)  _4  0^3) 
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The  second  method  of  choosing  an  approximating  cubic  polynomial  for  gn(p) 
is  to  evaluate  gn(p)  at  some  points  throughout  p  e   [0,1],  and  then  estimate 
the  coefficients  of  the  polynomial  using  least  squares.  Least  squares  was 
chosen  on  the  basis  of  ease  of  computation.  Enough  points  must  be  chosen 
to  insure  a  good  fit,  but  not  so  many  as  to  slow  the  estimation  of  the 
parameters.  After  trial  and  error,  seven  points  were  chosen,  which  under 
a  variety  of  conditions,  produced  good  approximating  cubic  polynomials. 
The  choice  of  the  seven  points  involved  several  considerations. 

1.  p  =  1/2  must  be  included  to  get  a  good  approximation  in  the  center 
of  the  range. 

2.  The  seven  points  must  be  symmetric  about  p  -   1/2  since  no  a  priori 
knowledge  is  available  about  Pr(r  >  0). 

3.  The  seven  have  points  close  to  zero  and  one  to  get  a  good  approxi- 
mation in  the  tail  of  the  error  density. 

The  seven  points  chosen  to  form  the  design  are  given  as  vector  p  below. 


(5.48) 


.001 

.10 

.25 

.50 

.75 

.90 

.999 


This  produces  the  P  matrix,  P"P,  and  (P'P) 


(5.49) 


P  = 


1     .001 

10  " 

10  ' 

1     .10 

10"^ 

10-^ 

1     .25 

.0625 

.015625 

1     .50 

.25 

.125 

1     .75 

.5625 

.421875 

1     .90 

.81 

.729 

1     .999 

.998001 

.997003 
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(5.50) 


P'P 


3.5 

2.693 

2.2895 

2.693 

2.2895 

2.03502 

2.03502 

1.85504 
1.7193 

(5.51) 


(P^P)"'  = 


.83901 

-.63919 

12.4463 

-6.9246 

95.4604 

-224.375 

137.574 

575.076 

-371.475 
247.65 

This  matrix  is  constant  so  that  to  evaluate  the  likelihood,  the  steps 
below  can  be  used. 


Algorithm  5.3   (Calculation  of  Likelihood) 
(5.52) 

(5.53) 


Evaluate  gn(p)  for  each  point  in  P.  Call  result  g. 


Estimate  the  coefficients  of  hp^(p)  using  (5.57)  or  (5.47), 

(5.54)  For  each  r.,  evaluate  p.  so  that  h.(p.)  =  0. 

(5.55)  For  p.,  calculate  f.  using  (5.3) 

n 

(5.56)  L  =  n   f.. 

i=l   ^ 


Let  a  be  the  coefficients  of  f\)(p) .     Then  step  (5.53)  can  be  accomplished 
via  least  squares  as 

(5.57)    a  =  (P'P)"'  P^g 

Since  (P"P)~  and  P"  are  fixed,  finding  an  approximating  cubic  polynomial 
can  be  done  by  a  simple  linear  transformation  of  the  vector  g. 

The  solution  of  h.  for  p.  such  that  h.(p.)  =  0  can  be  done  explicitly 
using  Cardan's  formulas  from  Selby  (1971).   If 
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(5.58) 


^      ^3   ^3     ^3 


Rewrite  h.(p)  as 


(5.59) 


where 

(5.60) 

(5.61) 

(5.62) 

Further,  let 
(5.63) 

(5.64) 


h.(p)  =  bn  +  b^q  +  q- 


q  - 


^0  ^27 


''l  =3 


3a. 


A  = 


B  = 


a^a-      ag 

9  -^-^    +  27  -^ 

a^       ^3 


^3 


4     27 


'A  .  A 

4     27 


The  only  real  root  of  the  cubic  equation  (5.59)  is 


(5.65) 
(5.66) 


q  =  A  +  B,  if 
T  *27  >  ° 


If  (5.66)  is  not  met,  then  either  two  real  roots  exist,  or  three 
imaginary  roots  exist.  If  either  of  these  conditions  exist,  the  algorithms 
under  consideration  will  return  0  for  f^.  to  indicate  that  ^2''*'3'  ^'^'^  '^4 
do  not  admit  a  solution  with  positive  likelihood  for  the  particular  residual 
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Cardan's  technique  can  be  used  with  a  Taylor  series  cubic  polynomial, 
or  with  a  fitted  cubic  polynomial.  In  either  case,  only  a^   depends  on  r^. . 
For  fixed  6  and  X,  h.  varies  only  in  a^.  Calculation  of  terms  in  (5.63) 
and  (5.64)  is  reduced  by  recalculating  only  those  factors  which  change 

when  r.  changes. 

In  Table  5.5  the  seven  point  cubic  approximation  method  is  compared 
to  the  binary  search  method  of  Algorithm  5.1.  Data  is  from  Example  7, 
later  in  this  chapter.  Timings  are  real  elapsed  clock  time  in  seconds. 
See  Appendix  A  for  a  description  of  the  computing  facilities  used. 

As  can  be  seen  from  Table  5.5,  the  approximate  method  is  five  to  fifteen 
times  faster  than  the  exact  method.  In  the  rest  of  this  work,  the  seven 
point  cubic  approximation  is  used  to  estimate  the  likelihood. 

In  terms  of  closeness  of  approximation  to  the  exact  method,  the  seven 
point  cubic  approximation  is  far  superior  to  the  third  order  Taylor  series 
expansion.  Figure  5.8  shows  both  a  Taylor  series  cubic  polynomial  approxi- 
mation to  a  g  function  for  the  normal  distribution  as  approximated  by 
GLF(0,. 1974, .1341, .1341)  for  a  residual  of  1.645.  The  true  root  is,  there- 
fore, at  p  =  .05.  For  this  example,  the  Taylor  series  fails  to  give  a 
root  in  the  interval  [0,1], 

Solving  for  Maximum  Likelihood 

In  the  preceding  section  a  seven  point  cubic  approximation  algorithm 
was  presented  which  can  quickly  evaluate  (5.4)  for  a  given  point  6  in  the 
parameter  space.  In  this  section  methods  will  be  presented  for  finding  9 
so  that  (5.4)  is  maximized. 

Solution  of  (5.4)  using  numerical  methods  can  be  accomplished  using 
gradient  techniques,  such  as  Fletcher  and  Powell  (1963),  or  derivative-free 
techniques.  Closed  form  expressions  for  the  partial  derivatives  of  L  with 


80 


Table  5.5   Calculating  the  likelihood. 


Seven 

Point 

Algori 

thm  5.1 

X2 

^3 

h 

-In  L 

Time 

-In  L 

Time 

.0029 

.0710 

.8841 

592.86 

.1094 

592.80 

.6484 

.0069 

.7453 

.6952 

553.02 

.1094 

553.01 

.7109 

.0112 

.8204 

.7164 

509.51 

.1328 

509.44 

.6563 

.0076 

.3124 

.4318 

504.25 

.1094 

504.32 

.5625 

.0044 

.8086 

.1596 

562.83 

.1250 

562.65 

.6250 

.0032 

.3858 

.7309 

613.55 

.1094 

613.28 

.9219 

.0047 

.8214 

.3814 

583.57 

.1406 

583.20 

.7188 

.0031 

.5720 

.0820 

568.92 

.1250 

568.99 

1.78125 

.0087 

.3031 

.4645 

493.23 

.1250 

492.75 

.7891 
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respect  to  each  parameter  do  not  exist.  Numerical  derivatives  could  be 
used,  but  investigation  of  likelihood  contours  indicated  that  a  gradient 
technique  may  have  severe  difficulty  with  the  number  of  turns  in  the 
surface.  Derivative-free  techniques  are,  in  general,  slower  than  gradient 
techniques,  but  are  more  reliable  when  the  function  being  optimized  is 
not  smooth  (Chambers,  1979). 

Derivative-free  techniques  can  be  classified  as  surface  following  or 
direct  search.  A  surface  following  algorithm  uses  a  series  of  function 
evaluations  to  move  along  the  surface  of  the  function  being  optimized.  An 
example  of  a  surface  following  algorithm  is  the  Simplex  algorithm  of 
Nelder  and  Mead  (1964).  While  not  using  derivatives,  a  surface  following 
algorithm  is  subject  to  a  similar  disadvantage--susceptibility  to  local 
minima.  A  direct  search  algorithm  involves  evaluating  the  function  at  a 
continuingly  evolving  set  of  parameter  points  throughout  the  parameter 
space.  A  simple  algorithm  might  involve  a  Euclidean  grid  over  the  parameter 
space  and  the  evaluation  of  the  function  at  each  point  on  the  grid.  Such 
a  technique  is  not  susceptible  to  local  minima,  but  has  difficulty  with 
large  changes  in  function  values  which  occur  between  grid  points. 

A  direct  search  technique  suitable  for  arbitrary  functions  is  the 
controlled  random  search  technique  (Price,  1977).  The  algorithm  is 
presented  below. 

Algorithm  5.4  (Controlled  Random  Search) 

(5.66)  Select  k  points  9  from  0.  Call  them  S. 

(5.67)  Select  j  points  from  S.  Call  the  centroid  e". 

(5.68)  Select  1  point  from  S.  Call  it  Q^^K     Calculate  a  new  point  via 


(5.69)  9  =  2-9-  9^P^   If  9  ?!  0,  goto  (5.67) 
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(5.70)    Evaluate  L(9).  If  L(0)  >  ^^   L(y),  then  replace  y  with 
Otherwise  goto  (5.57). 


(5.71)    If 


min  ,  tn\         niax 
eeS 


L(9)  -  "^11     L(9) 


<  e   stop. 


Otherwise  goto  (5.57). 


The  number  of  points  in  S  (called  the  storage),  is  arbitrary.  Values 
of  20- j,  where  j  is  the  dimension  of  9,  have  worked  well  in  practice. 
This  algorithm  has  been  applied  to  a  wide  range  of  optimization  problems 
(Khuri  and  Myers,  1981;  Khuri  and  Conlon,  1981)  and  is  well-suited  for 
problems  with  difficult  boundary  conditions  and  multiple  optima. 

For  solving  the  maximum  likelihood  problem.  Price's  (1977)  procedure 
is  quite  slow,  requiring  many  evaluations  of  L(9),  but  the  procedure 
invariably  converges.  Simplex  algorithms  were  implemented  and  tested,  but 
could  not  handle  the  lack  of  smoothness  in  the  surface  of  L(9). 

An  additional  advantage  of  Price's  procedure  over  other  optimization 
algorithms  is  its  lack  of  a  single  starting  point.  Choosing  a  starting 
point  for  a  gradient  technique  or  a  Simplex  algorithm  may  predetermine 
the  result  if  local  minima  exist.  Andrews  (1974)  developed  a  median 
regression  estimator  to  use  as  a  starting  point  for  iteratively  reweighted 
least  squares  in  finding  estimates  using  a  SINE  ^   function.  Holland  and 
Welsch  (1977)  recommend  using  a  LAV  estimate  as  a  starting  point.  Birch 
(1980a)  reviews  the  effect  of  starting  points  on  IRWLS.  Price's  procedure 
requires  only  that  the  parameter  space  0  be  defined  so  that  (5.56)  can 
be  accomplished,  and  the  test  of  (5.69)  can  be  performed.  In  most  cases, 
0  can  be  a  j-dimensional  rectangle.  The  size  of  the  rectangle  will  control 
the  rate  of  convergence. 
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In  continuously  adaptive  M-estimation  using  the  GLF  as  an  error 
family,  0  need  only  be  defined  so  that  the  true  values  of  B  and  X  are 
sure  to  lie  within  0.  This  can  be  done  by  using  initial  least  squares 
estimates  for  6,  and  centering  the  limits  around  B  using  a  very  small 
confidence  coefficient,  a.  Even  under  serious  departures  from  normality, 
if  a  is  chosen  small  enough  we  can  be  reasonably  sure  that  the  true  B  lies 
in  the  confidence  interval.  For  X,  constant  bounds  given  below  will  suffice. 

(5.72)  1/lOa    <     X2     <     10a 

(5.73)  0       <    XyX^  <  1 

The  assumption  of  (5.72)  is  that  the  least  squares  estimate  of  the 
standard  deviation  of  the  error  process,  a,  is  within  a  factor  of  ten  of 
being  correct.  The  assumptions  that  X^  and  X.  are  less  than  one  serve 
only  to  restrict  fitting  to  densities  which  are  not  U-shaped.  If  during 
fitting,  the  estimated  values  of  the  parameters  are  close  to  a  boundary 
of  0,  the  boundary  could  be  relaxed.  The  bounds  in  (5.72)  and  (5.73) 
are  given  purely  for  computational  convenience.  The  true  bounds  ^re 
given  for  X  by  Figure  5.6,  and  for  B,  the  value  may  lie  anywhere  within 

A  listing  of  a  FORTRAN  program  to  perform  the  continuously  adaptive 
M-estimation  fitting  of  arbitrary  linear  models  using  GLF  densities,  a 
seven  point  cubic  approximation  to  gQ(p),  and  Price's  procedure  is  given 
in  the  Appendix. 

The  algorithms  of  this  section  can  easily  be  extended  to  cover  non- 
linear models.  One  can  write 


(5.74)  r.  =  y.  -  m(x.,B) 
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and  replace  (5.18)  in  the  estimation  of  the  likelihood.  The  remainder  of 
the  algorithms  are  unchanged. 

Examples 
Continuously  adaptive  M-estimation  is  demonstrated  in  the  examples 
which  follow.  Problems  involving  experimental  data  and  problems  involving 
generated  data  are  shown.  For  each  problem,  the  fit  is  given  along  with 
a  comparison  to  least  squares  estimates.  Data  for  each  example  are  given 
in  the  Appendix.  Symmetric  and  asymmetric  error  distributions  are  used 
in  the  generated  data  problems.  Both  designed  and  undesigned  situations 
are  considered.  The  examples  are  summarized  in  Table  5.6. 

Because  of  the  large  amount  of  computer  time  required  to  calculate 
a  continuously  adaptive  M-estimate  (CAM),  no  simulations  have  been  done 
to  date.  The  examples  used  in  this  chapter  were  chosen  to  demonstrate 
the  ability  of  CAM  to  perform  under  a  wide  variety  of  modeling  situations. 
For  each  example,  a  description  of  the  source  of  the  data  is  given. 
Estimates  of  the  regression  parameters  are  given  for  OLS  fits  and  for  CAM 
fits.  In  addition,  for  CAM  fits,  the  estimates  of  x   are  given  with  a  plot 
of  the  corresponding  error  density.  Since  CAM  admits  asymmetric  solutions, 
no  closed  form  expressions  for  standard  errors  of  the  regression  parameters 
from  the  literature  of  M-estimation  apply.  Bootstrap  estimates  (Carrol,  - 
1979)  would,  for  CAM,  be  far  too  costly.  Standard  errors  for  the  CAM 
estimates  remain  unknown. 

The  first  example  consists  of  fifty  points,  each  representing  a 
country.  Four  variables  measuring  economic  attributes  are  used  in  a  first 
order  multiple  linear  regression  model  to  predict  the  country's  average 
aggregate  personal  savings  rate.  The  fits  are  given  in  Table  5.7.  For 
the  least  squares  fit,  the  regression  parameter  estimates  and  their  standard 
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Table  5.6   Contents  of  examples. 


Example 


Source 


Design 


Model 


Error 


1 

Belsley 

(1980) 

none 

2 

Brownlee 

(1965) 

none 

3 

Generated 

2^ 

4 

II 

22 

5 

M 

3^ 

6 

II 

3^ 

7 

II 

none 

8 

II 

none 

9 

" 

none 

10 

It 

none 

mult,  linear  unknown 

mult,  linear  unknown 

ANOVA  sym  GLF 

ANOVA  asym  GLF 

full  quadratic  sym  GLF 

full  quadratic  asym  GLF 

simple  linear  slash 

simple  linear  contaminated  normal 

simple  linear  bimodal 

nonlinear  normal 


87 


Table  5.7   OLS,  CAM  for  Example  1. 

OLS  CAM 

3  s.e.               B 

28.5663  7.3545 

-.4612  .1446 

-1.6914  1.0836 

.0003  .0009 

.4097  .1962 


24.589 

.47358 

-.36548 

.10018 

-1.7135 

.49368 

-.003698 

.  39483 

.55561 

4000  iterations 

-   In  L  = 

138. 

,2 
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errors  are  given.  For  continuously  adaptive  M-estimation  (CAM),  the 
regression  parameter  estimates  are  given,  as  well  as  estimates  of  the 
parameters  of  the  error  density.  The  graph  of  the  estimated  error  density 
corresponding  to  these  parameters  is  shown  in  Figure  5.9. 

The  iteration  count  given  in  Table  5.7  is  the  number  of  times  a  new 
parameter  point  succeeded  in  producing  a  negative  log  likelihood  smaller 
than  the  previously  largest  held  value.  The  convergence  criteria  for  CAM 
in  all  examples  was 

(5.75)  LM  -  LN     <    ^1 

LM  +  LN        •^' 

where 

(5.76)  LM  =  ^11     -In  L(e)   ,   LN  =  ^^^  -In  L(9) 

The  OLS  and  CAM  regression  parameter  estimates  in  this  example  are 
quite  close  to  each  other,  indicating  that  CAM  can  give  estimates  similar 
to  OLS  when  the  data  are  symmetric  and  mound-shaped. 

Figure  5.9  indicates  that  little  information  is  available  about  the 
tails  of  the  error  distribution.  There  are  no  turning  points  in  the 
estimated  density.  In  this  example,  (A^.X.)  =  (.49368, .39483)  which  gives 
a  density  with  a  skewness  close  to  zero,  and  a  kurtosis  close  to  2.2, 
according  to  tables  in  Ramberg  et  al.  (1979).  CAM  estimates  the  true 
error  in  this  example  to  be  lighter  tailed  than  the  normal  distribution. 

Example  2  is  from  Brownlee  (1965),  and  was  analyzed  by  Andrews  (1974). 
Twenty-one  points  on  three  variables  are  analyzed  using  a  first  order 
multiple  linear  regression  model.  Table  5.8  gives  the  results  of  the  OLS 
and  CAM  fits. 

The  graph  of  the  error  density  corresponding  to  the  A  parameter 
estimates  is  given  in  Figure  5.10.  This  density  is  very   close  to  a  uniform 
distribution  on  [-6,6].  This  is  quite  reasonable  given  the  lack  of  data. 
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Table  5.8   OLS,  CAM  for  Example  2 


OLS 


CAM 


s.e. 


39.9197 

11.896 

.7156 

.1349 

1.293 

.3680 

-.1521 

.1563 

-37.225 

-.080009 

.55336 

.20396 

1.8710 

.92584 

-.20964 

.98833 

5673  iterations 

-   In  L  =  48.06 
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With  only  twenty-one  points,  a  good  error  density  estimate  is  not  avail- 
able. The  estimated  density  is  uniform,  indicating  ignorance  about  shape. 
Example  3  demonstrates  the  performance  of  CAM  when  the  error  is  truly 
normal.  The  data  consist  of  eighty  points,  twenty  points  in  each  of  four 
cells  of  a  2x2  layout.  The  model  used  was 


(5.77)        ^ijk^^^^i  -^Sj-^S-jk 


i  =  1,2    j  =  1,2    k  =  1,...,20 


where 


2  2 

(5.78)  I  a.  =  0      ,     I  3.  =  0 

i=l  '  j=l  J 

The  true  values  of  y,  a-,,  B-.  were  4,  2,  and  -1  respectively  and  e^ -j^  -  iid 
n(0,l).  The  results  of  the  fit  are  given  in  Table  5.9.  Figure  5.11  shows 
the  graph  of  the  estimated  density  compared  to  the  graph  of  the  true 
normal  density.  With  eighty  points  CAM  produces  a  reasonable  replication 
of  the  true  error.  Little  is  lost  in  terms  of  squared  residuals  for  this 
fit.  Least  squares,  by  definition,  must  produce  a  sum  of  squared  residuals 
less  than  any  other  method.  For  Example  3,  this  sum  was  78.83.  For  CAM, 
the  sum  of  the  residuals  was  3.74  and  the  sum  of  squares  was  79.77.  The 
parameter  estimates  of  CAM  are  not  as  close  to  the  true  values  as  the  OLS 
estimates.  Of  the  eight  examples  for  which  true  values  are  known,  this 
is  one  of  two  ^  for  which  CAM  estimates  were  further  from  the  true 
values  than  the  least  squares  estimates.  The  sum  of  squared  differences 
from  the  true  values  to  the  estimates  was  .0373  for  OLS,  and  .0838  for  CAM. 
The  sum  of  absolute  differences  was  .2837  for  OLS,  and  .413  for  CAM.  This 
is  actually  quite  close  agreement. 
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Table  5.9   OLS,  CAM  for  Example  3 


OLS 


CAM 


s.e. 


4.0852 

.1959 

1.9727 

.2253 

1.1712 

.2263 

3.9854 

-.454 

2.1530 

.1019 

-1.2454 

.0379 
.0905 

2644  iterations 

-  In  L  = 

110. 

.9 
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Figure  5.12  shows  the  residuals  of  CAM  and  OLS  for  Example  3.  Again 
there  is  little  difference  between  the  methods  when  the  underlying  error 
is  normal.  These  residual  histograms  indicate  the  problems  with  examining 
such  data  to  determine  the  shape  of  the  error  process.  The  least  squares 
residuals  do  not  indicate  a  normal  distribution. 

Example  4  uses  the  same  design  and  model  as  Example  3.  The  error  for 
Example  4  is  close  to  uniform,  but  skewed  to  the  right.  It  was  generated 
from  a  GLF  with  X  =  (4. 396, . 1 , .95, .05).  Table  5.10  gives  the  results  of 
the  fits.  The  true  values  of  y,  a-,,  and  B-t  were  5,  2,  and  1,  respectively. 
Thirty  observations  were  in  each  cell.  The  sum  of  absolute  deviations  of 
the  e  estimates  to  the  true  values  was  1.081  for  OLS  and  .3266  for  CAM. 
The  sum  of  squared  deviations  was  .5048  for  OLS  and  only  .0673  for  CAM. 
The  sum  of  squared  OLS  residuals  was  1147.9  and  for  CAM  was  1478.6.  The 
sum  of  CAM  residuals  was  25.24.  For  these  data,  CAM  has  identified  the 
expected  value  of  the  dependent  variable  much  more  closely  than  has  OLS. 
The  positive  skewness  results  in  more  positive  true  error  values.  CAM 
reflects  this  in  its  residuals  which  sum  to  a  positive  value.  OLS  residuals 
must  sum  to  zero,  so  in  the  presence  of  positively  skewed  error,  OLS  will 
overestimate  the  expected  value  of  the  dependent  variable.  The  residual 
histograms  in  Figure  5.14  fail  to  clearly  reflect  the  asymmetry.  Figure 
5.13  shows  the  graphs  of  the  true  density  and  the  CAM  estimate. 

Example  5  demonstrates  the  use  of  CAM  on  a  full  quadratic  model.  The 
true  model  is 

(5.79)    y.  =  1  +  .2x^.  +  .4x2-  -  .5x^.  +  .05x2.  .  ,02x2.  +  t. 

where  e.  -  GLF(0,  .3,  .2,  .2)  and  the  design  is  a  3^  factorial  with  10  repli- 
cations. Table  5.11  gives  the  results  of  the  fit.  Figure  5.15  shows  the 
graphs  of  the  estimated  and  true  error  densities. 
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Table  5.10   OLS,  CAM  fits  for  Example  4 


OLS 


3 

5.405 

1.425 

.899 


s.e. 

.4953 
.5719 
.5719 


CAM 


5.0135 

1.7482 

.9387 


X 

2.9819 
.1230 
.98778 
.14925 


2529  iterations 
-  In  L  =  293.6 
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Table  5.11   OLS,  CAM  for  Example  5 

OLS  CAM 

6  s.e.  g 

.6936  .1913 

.1690  .1048 

.1538  .1048 

.7068  .1283 

.1194  .1815 

.3942  .1815 


7829 

.3348 

1719 

.1019 

0250 

.0703 

6679 

.0326 

1624 

2572 
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The  error  here  is  symmetric,  so  OLS  might  perform  well.  The  sum  of 
absolute  deviations  of  the  parameters  from  the  OLS  estimates  is  1.3737, 
and  for  CAM  is  1.2779.  The  sums  of  squared  deviations  are  .3984  and  .3388 
for  OLS  and  CAM,  respectively.  CAM  is  capable  of  estimating  several 
parameters  in  the  regression  and  can  outperform  OLS  even  when  the  error  is 
symmetric. 

Example  6  involves  the  same  design  and  the  same  model  as  Example  5. 
The  error  has  smaller  variance  and  is  asymmetric,  skewed  to  the  left.  It 
is  a  GLF  (-.3472, .8, . 2, .8) .  The  results  of  the  full  quadratic  fit  using 
OLS  and  CAM  are  given  in  Table  5.12. 

As  in  Example  4,  when  the  error  is  asymmetric  CAM  produces  an  estimate 
vector  much  closer  to  the  true  vector  whether  measured  by  sum  of  squared 
deviations,  or  sum  of  absolute  deviations.  Here,  the  OLS  squared  deviation 
sums  to  .10416  and  for  CAM,  only  .0036.  The  error  density  estimate  indicates 
a  distribution  skewed  to  the  left  and  concentrated  on  [-1,1].  Figure  5.16 
confirms  that  CAM  produces  a  reasonable  approximation  to  the  shape  of  the 
error  density. 

Example  7  is  a  simple  linear  regression  on  100  points,  where 

(5.80)  y^  =  1  -  .2x.  +  A. 

X.  are  iid  U(0,1)  and  A.  have  a  slash  distribution,  formed  by  taking  a 
normal  (0,1)  random  variate  and  dividing  it  by  a  uniform  (0,1).  This  error 
has  a  wery   high  variance  and  is  popular  in  simulation  studies  (Gross,  1976; 
Andrews  et  al.,  1972).  This  example  was  included  to  try  CAM  on  a  known 
error  of  high  variation,  not  in  the  GLF.  The  resulting  fits  are  given  in 
Table  5.13.  The  data  points  are  plotted  with  the  OLS  and  CAM  regression 
lines  in  Figure  5.17.  The  graph  of  the  error  density  estimate  is  given  in 
Figure  5.18 
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Table  5.12  OLS,  CAM  for  Example  6 


OLS 


CAM 


6 


s.e. 


,9354 
,2091 
,4442 
,4238 
.3090 
.1783 


,1260 
,0690 
,0690 
,0845 
.1195 
.1195 


.97328 

-.35527 

.15920 

.82139 

.38319 

.24942 

-.47745 

.96634 

.045272 

-.040653 

39  iterations 

In  L  =  58.35 
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Table  5.13   OLS,  CAM  for  Example  7 


OLS 


CAM 


3 

1.3810 
2.0691 


s.e. 

2.7580 
4.7780 


49554 

1.7117 

56511 

.0044930 

.017080 

.009186 

2283  iterations 
-  In  L  =  374.1 
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The  slash  distribution  can  produce  yery   large  outliers  as  seen  in 
Figure  5.17.  Three  very   large  and  tv/o  small  outliers  are  present  in  the 
data.  CAM  again  produces  a  line  closer  to  the  true  line  than  OLS.  In 
squared  deviation  CAM  is  superior,  .38778  to  5.294.  OLS  is  well  known 
to  be  highly  sensitive  to  outliers.  The  high  outlier  at  x  =  .91  pulls 
the  OLS  slope  estimate  up.  The  CAM  slope  estimate  is  much  closer  to  the 
true  value.  The  X   estimate  given  by  CAM  indicates  a  distribution  of 
properties  similar  to  those  of  a  slash  distribution.  The  small  ^,  and  X. 
indicate  that  outliers  are  rare.  The  low  X_  value  indicates  a  very   high 
variance.  The  combination  of  low  X^,X.   and  low  A^  indicates  a  distribution 
with  unusual,  but  very   extreme  outliers. 

The  extra  information  provided  by  CAM  does  not  seriously  degrade  the 
performance  as  measured  by  the  sum  of  squares  of  the  residuals  (SSE).  For 
OLS,  SSE  is  19371  and  for  CAM  it  is  19915.  CAM  residuals  sum  to  220.2 
indicating  that  CAM  has  ignored  some  high  data  points. 

Example  8  is  a  simple  linear  regression  on  200  points  with  a  contami- 
nated normal  error  distribution.  The  independent  variable  is  distributed 
uniform  (0,1),  and  the  error  is  normal  (0,1)  with  probability  .95  and 
normal  (0,10)  with  probability  .05.  The  true  regression  parameters  are 
2  and  1.2.  The  resulting  fit  is  shown  in  Table  5.14.  CAM  has  superior 
absolute  deviation  from  the  true  parameter  vector,  .0685  to  .1115,  and 
superior  squared  deviation  .00455  to  .0101.  The  SSE  for  OLS  is  227.7  and 
for  CAM  is  228.8.  CAM  again  outperforms  least  squares  even  when  the  data 
are  symmetric  and,  in  this  case,  predominantly  normal.  The  data  have  no 
rare  outliers  as  can  be  seen  in  Figure  5.19.  The  X   estimate  indicates  an 
error  distribution  of  moderate  variance,  symmetry  and  no  rare  outliers. 
This  describes  the  situation  in  Figure  5.19  quite  closely.  The  graph  of 
the  estimated  density  is  shown  in  Figure  5.20. 
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Table  5.14   OLS,  CAM  for  Example  8 


OLS 


CAM 


3  s.e. 

2.100  .1499 

1.2115  .2613 


2.0002 

.037713 

1.2683 

.087266 
.059402 
.055721 

2030  i 

terations 
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Example  9  has  100  points  with  a  bimodal  error  distribution.  The 
example  was  included  to  test  CAM  in  a  situation  where  the  GLF  distributions 
could  not  approximate  the  true  error.  Fifty  points  were  generated  from 
normal  (-1,1)  and  fifty  points  were  generated  from  normal  (1,1).  A  simple 
linear  model  with  3q  =  2  and  B-,  =  .1  was  used.  The  independent  variable 
was  distributed  uniform  (0,1).  Table  5.15  gives  the  results  of  the  fit. 
In  this  case,  the  least  squares  estimates  are  closer  to  the  true  estimates, 
.1943  to  .348  in  absolute  deviation,  .0210  to  .0667  in  squared  deviation. 
SSE  for  OLS  is  218.3.  SSE  for  CAM  is  218.5.  The  bimodal  nature  of  the 
error  process  can  be  seen  in  Figure  5.21.  CAM  is  inappropriate  in  this 
case,  as  is  least  squares.  The  X   estimate  fails  to  indicate  a  problem 
with  the  fit.  Only  graphical  methods  can  detect  the  inappropriateness 
in  this  case.  The  OLS  and  CAM  residuals  are  shown  in  Figure  5.22.  The 
estimated  error  density  is  shown  in  Figure  5.23. 

Example  10  uses  CAM  to  fit  a  nonlinear  model.  While  the  current 
research  is  primarily  concerned  with  linear  models,  the  procedures  and 
algorithms  developed  for  CAM  and  described  in  this  chapter  are  easily 
extendable  to  nonlinear  models.  The  model  fit  in  this  example  is  given 
below. 

(5.81)    y.  =  20  +  e"-°^^i  +  e.  i=l,...,100 

where  e.  ~  normal  (0,.l)  and  x.  =  1 ,2,3, . . . ,100.  Both  CAM  and  nonlinear 
least  squares  gave  very   good  results.  The  nonlinear  least  squares  was 
given  the  true  parameter  values  as  a  starting  point  for  the  fit.  See 
the  Appendix  for  details  of  this  computation.  Table  5.16  gives  the  results 
of  the  fit.  CAM  is  again  closer  to  the  true  parameter  values,  but  the 
fits  dre  \/ery   close.  SSE  for  nonlinear  LS  is  .9057  and  for  CAM  is  .9175. 
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Table  5.15   OLS,  CAM  for  Example  9 


OLS 


CAM 


3 


s.e. 


2.13 
.0357 


.3048 
.4943 


2.2301 

.2200 

-.0805 

.0933 
.1026 
.0772 

1891  ite 

rations 

-  In  L  = 

176, 

,7 
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Table  5.16   CAM,  Nonlinear  LS  for  Example  10 


Nonlinear  LS 
§  s.e. 


CAM 


19.9889 

.0485 


.01939 
,004154 


20.000 

.00178 

.0528 

.16966 
.01086 
.01055 

12300  iterations 

-  In  L  = 

91 

.63 
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Figure  5.24  shows  how  close  the  two  fits  are.  CAM  should  be  close  to 
nonlinear  LS  in  this  case  since  the  error  is  normal.  The  estimated  error 
density  is  given  in  Figure  5.25. 
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CHAPTER  6 
SUmARY  AND  SUGGESTIONS  FOR  FUTURE  RESEARCH 


M-estimators  in  regression  offer  alternatives  to  least  squares  in 
their  ability  to  cope  with  outliers,  and  to  model  the  error  process  more 
closely.  Previous  work  has  centered  on  developing  specific  ^   functions 
to  outperform  least  squares  in  particular  situations.  Some  asymptotic 
results  for  M-estimators  were  presented  in  Chapter  2,  The  effects  of 
misspecification  of  ^   in  symmetric  error  density  problems  was  detailed  in 
Chapter  3  for  two  classes  of  symmetric  error  densities.  Methods  for 
choosing  ii   a  priori  were  presented  in  Chapter  4.  \lery   little  work  has  been 
done  with  minimax  estimators.  It  seems  likely  that  useful  ii   functions  can 
be  derived  by  specifying  a  reasonable  set  of  constraints,  such  as  finite 
variance  and  finite  range,  on  the  class  of  error  distributions  and  finding 
the  minimax  i|;  for  that  class. 

Most  current  work  has  focused  on  attempts  to  find  one  ^  function,  or 
in  the  case  of  adaptive  M-estimation,  a  small  group  of  ijj  functions,  which 
will  adequately  model  all  potential  situations.  Many  criteria  for  measuring 
the  success  of  such  entries  have  been  proposed.  In  all  these  methods,  a 
practitioner  faced  with  a  set  of  data  to  analyze  must  choose  a  yp  function 
(or  a  small  set  of  ij;  functions),  before  data  analysis  can  begin.  CAM  has 
been  proposed  to  offer  an  alternative. 

CAM  models  both  the  regression  line  (the  expected  value  of  the  response 
given  the  independent  variables)  and  the  error  density.  Error  density 
estimates  from  CAM  have  been  demonstrated  to  approximate  the  shape  of  the 
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unknown  error  densities.  CAM  can  fit  a  wide  variety  of  alternative  i)   func- 
tions, continuously  indexed  by  the  parameters  of  the  GLF.  In  this  way  no 
one  'J;  function,  such  as  SINE  or  Bisquare,  or  finite  sets  of  ^   functions, 
such  as  in  Moberg  et  al.  (1980),  is  chosen  a  priori.  CAM  models  asymmetric 
error  densities  quite  well.  All  current  M-estimators  assume  symmetric  error 
densities.  In  addition,  CAM  consistently  produced  regression  parameter 
estimate  vectors  closer  to  the  true  values  than  did  least  squares.  Only 
when  the  data  were  normal,  or  could  not  be  modeled  within  GLF,  was  CAM 
further  from  true  values  than  least  squares. 

Much  work  remains  to  be  done  on  developing  and  testing  CAM  estimates. 
Algorithmic  improvements  are  needed  to  produce  estimates  faster.  More 
testing  situations  need  to  be  tried,  as  well  as  comparisons  to  other  M-esti- 
mators. Due  to  the  time  required  to  produce  estimates  (forty-five  minutes 
of  real  time  for  a  single  problem  was  not  uncommon),  a  simulation  study 
was  not  feasible  at  this  time.  Rates  of  convergence  to  theoretical  maximum 
likelihood  variances  are  unknown.  No  method  of  estimating  the  standard 
errors  of  the  estimates  currently  exists. 

Standard  least  squares  regression  results  need  to  be  extended  to 
M-estimators.  Selection  of  variables  in  M-estimation  has  been  studied  for 
LAV  (Gentle  and  Hanson,  1977)  and  for  minimum  sum  of  maximum  absolute 
weighted  errors  (Narula  and  Wellington,  1979).  Design  criteria  in  response 
surface  methodology  depend  directly  on  the  method  of  estimation.  Current 
formulations  for  minimum  bias  and  minimum  variance  designs  assume  least 
squares  estimators  will  be  used.  These  criteria  need  to  be  extended  to 
M-estimators.  Construction  of  designs  resistant  to  non-normality  needs  to 
be  extended  beyond  first  order  designs.   Improved  standard  error  estimates 
are  a  topic  for  future  research. 


APPENDIX  A 
COMPUTING  FACILITIES 


The  computations  in  this  study  were  performed  using  several  computers 
and  operating  systems. 

Preliminary  development  of  software  for  Price's  procedure  was  done 
on  a  Digital  Equipment  Corporation  PDP  11/34,  running  RSXll-M. 

Development  of  software  for  iteratively  reweighted  least  squares, 
continuously  adaptive  M-estimation,  generation  of  preliminary  test  data 
sets,  and  elementary  plotting  routines  was  done  using  the  facilities  of 
the  Northeast  Regional  Data  Center  (NERDC).  NERDC  operates  an  Amdahl  470 
V/6-II  and  an  IBM  3033-N  under  VM,  OS/MVS,  and  JES2/NJE.  Software  develop- 
ment was  performed  using  a  FORTRAN  Gl  compiler  available  through  the  McGill 
University  System  for  Interactive  Computing  (MUSIC). 

Construction  of  final  test  data  sets  and  CAM  fits  were  performed  using 
the  facilities  of  the  Center  for  Instructional  and  Research  Computing 
Activities  (CIRCA).  CIRCA  operates  two  Digital  Equipment  Corporation  VAX 
11/780  computers  running  VMS  Version  2.3. 

Non-linear  data  fitting  and  graphics  were  obtained  using  the  Statisti- 
cal Analysis  System  (SAS),  running  at  the  NERDC.  Graphics  were  drawn  on 
a  Gould  5100  electrostatic  plotter.  Input  for  the  graphics  was  prepared 
using  the  Conversational  Monitoring  System  (CMS)  running  at  the  NERDC. 

Least  squares  fits,  matrix  inversion,  random  number  generation,  and 
solution  of  systems  of  nonlinear  equations  were  performed  using  software 
from  the  Eighth  Edition  of  the  International  Mathematical  and  Statistical 
Libraries  (IMSL). 
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APPENDIX  B 
IRWLS  IN  SAS 


The  SAS  program  below  can  be  used  to  perform  IRWLS.   It  assumes  that 
a  SAS  data  set  has  been  created  with  a  dependent  variable  Y,  and  indepen- 
dent variables  XI,  X2.  The  ^p   function  used  here  is  for  the  power  family. 
The  power  to  be  used  is  specified  in  line  8.  Lines  8  and  9  could  be 
replaced  to  specify  v/eights  for  a  -p   function  other  than  that  for  the  power 
family. 

PROC  NLIN; 

PARMS  B0=0    B1=0     B2=0; 

FITTED  =  BO  +  B1*X1  +  B2*X2; 

MODEL   Y  =  FITTED; 

DER.B0=1; 

DER.Bl-Xl; 

DER.B2=X2; 

Q-1.5; 

_WEIGHT_=(ABS(Y-FITTED))**(Q-2); 

This  program  is  adapted  from  one  which  appeared  in  Volume  IV,  Number  4 
of  SAS  Communications. 
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APPENDIX  C 
AVF  IN  THE  POWER  FAMILY 


In  Chapter  3,  AVF  results  were  presented  for  the  power  family  of 
distributions  indexed  by  p,  c,  and  a!  The  following  plots  show  the 
value  of  AVF  for  combinations  of  p,  the  true  power,  and  q,  the  power 
used  to  perform  the  M-estimation.  Three  contour  plots  of  AVF  for  c=.5, 
.75,  and  1.0  are  presented.  Relative  AVF  is  presented  in  a  fourth 
contour  plot.  The  four  plots  are  then  presented  in  three-dimensional 
form. 


126 


a 
II 
o 


I 

I 

I 


■-  u> 


o 


o 


iij 


c_> 


s- 


cn 


127 


II 
u 


N 


r-   ^0 

I 
I 
I 


N   O 


o 

■+■ 

k 

<- 

o 

C-l 

in 

o 

1 

1 

II 

iO 

i 

o 

T~ 

'i 
u_ 

o 

N 

< 

s- 

s 
o 

LlJ 


!S 


128 


o 


\ 


o 

D 


O 


o 

TO 


1-^ 


o 
to 


o 


\ 

\ 

\ 

■     1    • 

1     '     1     ' 

a       o 

o 


\, 


o  o  o  o  o 

CT-  X3  r^  5i  "^^ 


o 
o 


o 


o 


<€■    O 

. 

^— 

O    "- 

II 

o 

o 

"•> 

>) 

E 

LL. 

o 

•5 

N                i 

i- 

O 

> 

Qu 

Li 

u. 

z 

> 

bJ 

< 

O 

U 

oo 

_I 

• 

o 

0) 

s- 

3 

D> 

■r™ 

u. 

(V( 


129 


N 


\ 


s 


\ 


K  \ 


I       \ 


\ 


\ 


I i T r 


N ^^-^ 


a       o 
o 


r 


C5> 


o 

■so 


to 


O 


Q- 


/ 


o 

M 


o 


u. 


S- 


o 


'^ 

QJ 

> 

t. 

•r— 

c 

+-> 

fC 

"— 

^^ 

LU 

0) 

'■:> 

q: 

u 

_j 

■=a- 

CJ 

QJ 

S- 

=J 

cn 

130 


o 
o 


o 
o 


<N 


mm 


-ih  iiflJi 


I  y 


\' 


'-]  , 


/ 


/,- 


> 


o 


hi  "y  /  r>< 

mm 


V  U! 


r-va  •• 


in 

II 
o 


n3 

s- 
<u 

o 


LO 


cu 


u. 


131 


o 


tM 


I   r  i 

UK 


miirf 


dHUij 


''il/if 


\ 


o 
o 


II 

o 

>> 


CO 
Uu 

s- 
cu 
s 
o 
a. 


vo 
o 

01) 

3 
CD 


132 


o 
o 


II 
o 


mm 

I      ■       i"J      K/  I  kit    .Vf 

(HkfiMiiii' 


si 

Wih 

/  L4i. 


II 
u 


o 


C7> 


133 


Q 


1- 


o 

Q. 


cu 

> 


to 


0) 


00 

&. 

C7) 


APPENDIX  D 
g  (p)  PLOTTED  FOR  COMBINATIONS  OF  A^  AND  X^ 

These  plots  show  g»(p)  plotted  on  the  vertical  axis  and  p  on  the 
horizontal  for  X^  =  (.4,  .13,  .01)  and  X^  =  (.4,  .13,  .01).  These  values 
represent  the  values  of  X-,  and  X.  arrived  at  during  the  ten  examples  in 
this  study.  All  the  plots  resemble  cubic  polynomials,  supporting  the  use 
of  the  seven  point  cubic  approximating  polynomials  developed  in  Chapter  5. 
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APPENDIX  E 
FORTRAN  PROGRAM  FOR  CAM  FITTING 


This  source  listing  ran  on  a  VAX  11/780  under  VMS,  VAX-ll  FORTRAN 
compiler.  The  main  program  prompts  for  a  data  file  name,  the  ordinal 
number  of  the  dependent  variable  in  the  data  file,  the  number  of  independent 
variables,  and  the  ordinal  numbers  of  each.  It  performs  a  multiple  regres- 
sion CAM  fit  by  adding  a  constant  and  calling  subroutine  FITAPC  to  perform 
the  calculations. 

FITAPC  assumes  that  a  file  called  FITBND.DAT  exists  which  contains 
upper  and  lower  bounds  on  the  parameter  estimates.  OPTPRI  is  called  to 
perform  Price's  procedure. 

OPTPRI  calls  FITAPX  to  evaluate  the  likelihood  using  the  seven  point 
cubic  approximation  Algorithm  5.12.  FITAPB  checks  new  points  to  verify 
they  are  within  the  bounds  specified  in  FITBND.DAT.  OPTPRI  uses  VMS  timing 
routines  to  report  elapsed  real  time  every   hundred  iterations. 

Other  software  used  in  this  study  is  contained  in  a  FORTRAN  subroutine 
library  available  from  the  author  (Conlon,  1981).  The  library  contains 
eighty  subroutines  used  in  generating  data,  calculating  important  functions 
(ijj,  p,  f,  F),  and  fitting  M-estimates. 
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IKPIICIT    HEAL*8     (/V-H,0-Z) 
PARAMETER  f 

*  IXY=500, 

*  IV=50, 

*  IB=11, 

*  I3P=IB+3, 

*  IP=(IB+1)  *IXY) 


DIMENSION  Y  (IXY)  ,X  (IXY,IB)  ,B(IBP)  ,P  (IP) 
DIMENSION  f]4)  ,IX  (IB)  ,V  (IB) 

CEARACTEE*80    FN 


TYPE    *, 'Enter    file    name' 

RIAD(6,8001)  L,FN 

OPEN  (DNIT=1 ,NAaE=FN,STATUS='old'  ) 


ACCEPT  ♦,!? 

TYPE  *, 'Number  of  independent  variables' 

ACCEPT    ♦,NX 

DC    1=1, NX 

TYPE    *, 'Variable' ,1 

ACCEPT    *,IX  (I) 
END    DO 
NE=0 

RFAD(1,*)     N7 

READ(1,*,END=10)      (V  (I)  ,  1=  1  ,  NV) 
NP=NP+1 
Y  (NP)  =V  (lY) 

X(NP,1)=1.D0  1     ADD    A    CONSTANT    TERM 

DC    1=1,  NX  ' 

X  (NP,I+1)=7  (IX  (I)  ) 
END    DO 
GCTO    5 
10  CLOSE  (aNIT=10) 


CALL    FITAPC  (Y,X,3,F,NP,NX+1,IXY) 

■ "•■■" "■" 11 

B(NX+2)  ,B  (NX4-3) 


PI1=(-1.D0/B(NX  +  2f )  /(1-D0  +  3(NX  +  3))  + 

*  (1.DO/B(NX  +  2))/]l-D0  +  B    NX  +  I)) 
WRITE  (6,80  0  0)      (B(I)  ,  1=  1  ,  NX+ 1)  ,  PLI  ,  B 

*  B(NX+4)  ,F(3)  ,?(M 


STOP 

8000  FCRMATC 

*  // 

8001  FCEMAT(Q,A<L>)' 
END 
SUBROUTINE    FITAPC  (Y,  X,  B,  F,  N,  K,  NX) 

C*A  + 

C'  FIT    AN    ADAPTIVE    SSTIJIATE    BASED    ON    GLF    USING    PRICE'S 

C  PROCEDURE.     THE    LIKELIHOOD    FU?!CTION    IS    CALCULATED    USING 

C  A    SEVEN    POINT    CUBIC    APPROXIMATION.     THE    LIKELIHOOD 

C  CALCULATIONS    ARE    SHORTCUTTED    IF    ANY    POINT    HAS    A    DSN- 

C  SITY    VALUE    OF    ZERO. 


C;A- 


* 


IMPLICIT    SEAL*8     (A-H,0-Z) 

LOGICAL    FITAPB 

EXTERNAL    FITAPX    FITAPB 

DIMENSION    Y(N)  .X(NX,K)  ,B(13),F(4)  ,PAR(3000)  ,WK  (7500), 

BL0(l3)  ,BHI(l3)  ' 

IXY  (IP)     =    2    +    IP 
IXX(IP,IX)     =    2    +    N    +     (IX   -    1)     *    N    +    IP 


mi 

13    1=1, N 


PAR 

PAR 

DC    13'  1  =  1, N 
L  =  IXY  (I) 
PARiL)=Y  fl) 
DO    20    J=1,K 


L=IXX  (I.J) 
PAR(L)  =X{I, 
20  CONTINUE 


PAR(L)  =X{I,  J) 
NTINl 
13  CONTINUE 
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30 


C;A  + 

C 

C 

C;A- 

C;R  + 

C 

C 

C 

C;R- 

C 


c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 
c 


c 
c 
c 


c 
c 
c 

6 


10 


K  1=K+3 

OFHNfUNIT^I  1,NAMB='fitbnd.dat'    S?ATDS='ol'i') 

DC    20    1=1, K1 

READflT,*)     BLO  (I)  ,BHI  (I) 

CONTINUE 

CLOSE {nNIT=11) 

ITMAX=1000000000 

TCL=.001D0 

CALL    OPTPRI  {FTTA?X,FITA?B,3,K1,BL0,SEiI,  V,TOL,  IT  tlAX, 

♦  PAR,WK,IER) 
F(1)=N 

F  |2    =K 

F  (3)  =ITMAX 

F  (4)=V 

RETURN 

END 

SOBROUTINE    OPTPRI (OPTFNC, OPTB ND, X, NX, XLO, XHI, F, TOL, 

*  ITMAX,PAR,WK,ISS) 

USE    PRICE'S    PROCEDURE    TO    FIND    A    GLOBAL    MINIHUM    OF    A 
FUNCTION    WITH    ARBITRARY    BOUNDARY    DEFINITION. 


PRICE,     W.     L.      n977)     "A    CONTROLLED    RANDOM    SEARCH    PRO- 
CEDURE   FOR    GLOBAL    OPTICUS  ATION" ,     THE    CO.^PUTER 
JOURNAL,     VOL    20,     NO    4,    ?P    367-370. 

IMPLICIT    REAL*8     fA-H,0-Z) 

EXTERNAL    0PTFNC,0PTBND 

LOGICAL    OPTBND 

REAL*U    TI M  E  0,  TI  ME  1,TIMEX,  TIMED 

DIMENSION  XL0{NX)  ,  XHI  ( NX)  ,  X  (NX)  ,HK(1)  ,PAR  (1) 

DEFINE  THREE  ADDRESSING  FUNCTIONS: 


IXF  (IP) 

JXF(IP,IX) 
KFX  (IP) 


RETURNS  THE  LOCATION  IN  HK  OF  THE  IP-TH 
FUNCTION  VALUE  3HEN  THERE  ARE  NX  DIMEN- 
SIONS IN  THE  PROBLEM. 

RETURNS  THE  LOCATION  IN  WK  OF  THE  IX-TH 
COORDINATE  OF  THE  IP-TH  POINT  STORED  IN  »K 

RETURNS  THE  LOCATION  IN  WK  OF  THE  IP-TH 
POINT  INDICATOR  (0  FOR  NOT  USED,  1  FOR  USED) 


IXF  (IP)  =(IP  +  2)  *  (NX  +  2)  -1 
JXF  IP, 1X1=  (IP+^)  *  (NXi-2)  ^-IX 
KXF  (IP)  =  (IP  +  2)  *  (NX<-2) 


TRY    TO    FILL    THE    TABLEAU    FROM    THE    RESTART    FILE 

I     BASE    TIME 


CALL    RNDCMN 

TIMEO=SECNDS  (TIMEX) 

11=0 

NP=25*NX 

NR=  {NP+2)  *(NX  +  2) 


15 


FILL    IT    FROM    SCRATCH 

IT=0 

DC    15    IP=1  ,NP 
DO     10    IX=1,NX 

aK(IX)=XLO  (IX)  +SYSUNI(0)*  (XHI  (IX) -XLO  (IX)  ) 

CONTINUE 

IFf.  NOT.OPTBND(WK,NX,XLO,XHI,PAR)  )     GOTO    5 

J=JXF{IP,1)_ 

CALL    UTLMV8  (SK,  1,NX,  1,HK,  J,  1) 

F=OPTFNC (HK,PAR) 

J  =  IXF  (IP) 

WK  (J)  =F 
CCNTINUE 
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20  J=IXF ( 1 ) 

CALL    I1TLMM8  (WK,  J,  NP,  NX  +  2,FH  ,IH  ,  FN,I11) 

C 

C      GIT  THE  HEM  POINT 
C 

NT=0 

i F (NT. GE. 5  0. AND. MOD  (NT, 50)  .EQ.O)  TYPE . * , • it er= ' ,IT, 
#  •  tries=',NT 

J=JXF(0,1)  ^      „ 

CALL    DTLFL8  (WK,J,NX,1,0.D0) 
tr  —  tr yp  /  1  ^ 

CALL    dTLFLO  (^v}K,K,NP,NX  +  2,0.D0) 
DC    30    1=1, NX 
22  J  =  1+NP*SYSUNI  (0) 

K =K XF  f J1 

IF(WK  {K)  .NE.O.DO)     GOTO    22 
WK(K)  =1,D0 
DC    24    ID=1,NX 
JX=JXF(0,ID) 


KX=JXF(J,ID 
HK(JX)=HK(JX)+WK(KX) 

CCNTINUE 


2U 

30  CCNTINUE 

40  J  =  H-NP*SYSONI  (0) 


K  ~K  IC^  (J  \ 

IF(wk  (K)  .NS.O.DO)     GOTO    40 

DC    50    ID=1 ,NX 


JX=JXF(0,ID) 
K T^  TXFiJ  TDi 
WK(ID)=2'D0*WK{JX)/DFLOAT(NX)-yK(KX) 

50  CCNTINUE 

IF(.n6t.0PTBND (WK,NX,XLO,XHI,PAR) )     GOTO    21 

FE=OPTFNC  (WK,PAS) 
IF(FP.GT,FM)     GOTO    21 

c 

C      REPLACE  THE  FDNCTION 
C 

WK(IH)=FP 
DC    60    IX=1,NX 
JX=IM-NX-1+IX 
HKf  JX)=WK  (IX) 
60  CCNTINnE 

IT=IT  +  1 

C  CHECK    FOR    ITERATION    DIVISIBLE    BY     100 

C 

IF(aOD(IT, lOOl-NS.O)     GOTO    19 
TIME1=SECNDS(TIHEX)  I     TIME    NOW 

TIMED=TI«E1-TIME0  ELAPSED    TIME 

SRITE  (6,8001)     TI«ED,IT,F?,FN,F«, 

8001    *FCR!SATi'^^  =  '^,^^8!)),'     i=  •  .  17  ,  3G1  5- 6,/,  <NX>F1  0  .  5) 

19  IF(IT.GT.ITMAX)     GOTO    7001 

C 

C  CHECK    FOR    BAD    START 

C 

IF(IT.EQ.  1.  AND.P?1,FQ.FN)     GOTO    6 

If1dABST(FM-FN)/(F2  +  FN)  )  .LS.TOL)     GOTO    7001 

GCTO    20 
7001       IER=0  ,  ,     , 

CALL    UTLMV8  (WK  ,  IN-NX,  NX,  1 ,  X  ,  1  ,  1 ) 

F=WK(IN) 

IiaAX=IT 

RETURN 

END 

SOB  ROUTINE    UTLMM8 ( A, IB, IE, IS, FM, IM,FN, IN) 

C*^*      GET    THE    aiN    AND    MAX    FROM    A    TWO    DIMENSIONAL    ARRAY 
C  *  A- 

IKPLICIT    REAL*8     (A-H,0-Z) 

DIMENSION    A(1) 
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Ff?=A(I3) 

FN=A(IB) 

I!'=I3 
IN=IB 
I=IB 
K  =  1 
10  I=I+TS 

K=K  +  1 

IF(K.GT.I2)     GOTO    20 
IF(A(r)  .LH.FM)     GOTO     12 

FM=A(I) 
IK=I 
12  IF(A(I)  .GE.  FN)     GOTO    10 

FN=A  (I 

IN=I 

GCTO    10 
20  RETOSN 

END 

SOB  ROUTINE  DTLMV8  {  A,  IB,  IE,  IS,  B,  JB,  JS) 
C  •  A  + 
C*     MOVE  ELEMENTS  FROS  ONE  ARRAY  TO  ANOTHER 


C;A- 


ikplicit  real*8   (a-h,0-z) 
-  "1)  ,bI-- 


DIMENSION    A  (1)  ,3(1) 
I=IB 
K  =  1 
J=JB 

B  (J)=A(I) 
10  I=I+IS 

J=J+JS 
K  =  K  +  1 
IF(K.GT.IS)     GOTO    20 

B  (J)=a7i) 

GCTO    10 
20  RETURN 

END 

DCDBLE    PRECISION    FUNCTION    FIT AFX  (BETA, PAR) 
C  •  A  + 

C'  CALCULATE    LIKELIHOOD    FUNCTION    IN    AN    ADAPTIVE    GLF 

C  PROBLEM    USING    VERY    FAST    ALGORITHM 

C  •  A- 

ICIPLICIT    RSAL*8     {A-H-L,0-Z1 

DIMENSION    XPX  (10)  .  BETA  ( 1)  ,  PAR  ( 1 )_,  PT  (7)  ,  B  (4)  -XPXS{10) 

DATA    PT/.01D0..  1d6 , . 2 5D0 , . 5  DO , . 75D0 , . 9D0, . 99D0/ 


DATA    XPXS/7.d0,3.5D0.2. 6752D0,2.6752D0,2.2628D0, 

*  1.99  96  08  52DO,2.26  2  8DO,1.99960852DO, 

♦  1.8110213D0, 1.666769806D0/ 
DATA    PMl/. 01D0/ 

C 

C  GET    RELEVANT    VALUES 


C 


FITAPX=O.DO 

N=PAR (1) +PM1 

K=PAE  (2    +PM1 

L2=BETA  (K  +  1 

L3=BETA  (K  +  2' 

L4=BETA  {K  +  31 

L1=  (-1.D0/L2)/{1.D0tL3)  +  ( 1  .  D0/L2)  /  ( 1 .  DO +LU) 

XEX     1)=XPXS  " 

XPX  (2)=XPXS 


XEX  (3)  =XPXS 
XPX  (4)=XPXS 
XPX  (5)  =XPXS 
XFX  ?6)=XPXS 
XPX  (7)=XPXS 
XFX  (8)  =XPXS 
XPX (91=XPXS 
XEX  (10)  =XPXS('10) 
3  (1)=0.D0 
3  (2    =0.D0 

3    3    =0-D0 
B  m)=0. DO 
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C 

c 
c 


10 


c 
c 
c 


c 

c 
c 

c 
c 
c 

30 

c 
c 
c 


32 


900 
999 


GET  CUBIC  PIT  AT  ZERO 

Y=PT{I)  ♦*L3-  (1.D0-?T(I)  )**L4  +L1*L2 

B  (1)=B(1)  +1 

B  (2  =B{2  ♦Y*PT(I) 

B  (3)=B{3  +T*PT(I)  *PT(I) 

B  (uUb]4)  ♦I+PT  (I)  *PT  (I)  *PT  (I) 

CALL^IMSLSQ(XPX,1,4,B,4,ID,D1,D2,IE) 


T=B(3)/B(4) 
a=B]2[/B  4 
LA=IJ-T*T/3. 


T/3.D0 

LB=io^'io'Sa{)  74  074074  07DO*T*T*T-0, 
T1=-L3/2.D0 
T2=LB*LB/4.  DO 
T3=LA*LA*LA/27.D0 


3 333333 3333333D0*T*n+V 


IF  IT  DOESN'T  SOLVE  AT  ZERO, 

GOTO  900 


RETURN  ZERO  LIKELIHOOD 


IF{T2+T3.LT.0.D0) 

BA=DABS^^T  +  T4)^U.33333333333333333D0 

I  F  (TH-T4.LT.  0.  DO)  BA=-BA 

BE^DABS  (T1-T4)  **i  3333333333  3333333D0 

IF(TI-Ta.LT.O.DO)     BB=-BB 

XE=BA+Ba 

P=XP-T/3.D0 


IF  THE  ROOT  IS  NOT  BETTs'EEN 
IF(P.LT.0.D0.OR.P.GT.  1-DO) 
CALCULATE    THE    REAL    POINTS 


ZERO 
GOTO 


AND 
900 


ONE,    RETURN    ZERO    L 


J=0 
J=J  +  1 

IF(J-GT. 

GET    JTH 


N)     GOTO    999 
RESIDUAL 


PEED=O.DO 

PRED=PRED  +  BETA(!1)  *PAR  (2  +  N+(M-1)  *N+J) 

CONTINUE 

RESID=PAR (J+2)-PRED 

LEi!^7l07407407?044^7l)0*T*T*T-0.33333333333333D0*T*U+V 

T1=-LB/2.D0 
T2=IB*LB/4. DO 
T3=LA*LA*LA/27.D0 
IF(T2+T3.LT.0.  DO)     GOTO    900 

BA=DABsTfl  +  T4)"^l*.  333333333  33333333D0 

IF  {T1 +T4.LT.0,  DO)     BA=-DA  ^^     „ 

BEiDABS{Tl-T4)**.33333333333333333D0 

IF(T1-T4.LT.0.D0)     3B=-BB 
XP=BA+BB 
P=XP— T/3    DO 

IFfP.LT.O.DO)     FP=0,000000001D0 
IF(P.GT.O.DO)     FP=0.000000001DO       ^ 
IF(P.EQ.O.DO)     FP=L2/(L4**  (L4-1-D0)  ) 
IF    P.GT.O.DO.AND.P.LT.  1.D0)  ^FP  =  L2/( 
♦  L3*P**  (L3-1,D0)  *L4*  (1.t)0-P)  **{L4-1.D0)) 

IF(P.EQ.I.DO)     FP=L2/  '  "  "    '    ""'  " 

FITAPX=FITAPX  +  DLC_  .--, 

TYPE    *-FP,?ITAPX,32SID 

GCTO    30 

FITAPX=-1-D30 

FITAPX=-FITAPX 


/(L3***(L3-1-D0)  ) 
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C      WBITE(6.17)  FITAPX 
17     FCRMAT(5X,G12.4) 

P.ZTUEN 

END 

LCGICAL  FUNCTION  FTTAPB  (BETA, N3ETA,BSTAL0, BETAHI, PAR) 
C  *  A  + 

C'     BCONDARY  FONCTIOM  FOR  PRICE'S  PROC3D0RE  WHEN  USED 
C      AS  AN  ADAPTIVE  ROBUST  REGRESSION  ALGORITHM 
C ;  A- 

C'     THIS  IS  A  SIMPLE  EUCLIDEAN  BOUNDS  BOUNDARY  FUNCTION. 
C      PRICE'S  PROCEDURE  AS  IMPLEMENTED  IN  THIS  LIBRARY  WILL 
C      ACCEPT  A  BOUNDARY  FUNCTION  OF  ARBITRARY  COMPLEXITY 
C 

IHPLICIT  REAL*B  (A-H,0-2) 

DIMENSION  BETA(l)  ,3ETAL0{1)  ,BETAHI  (1)  ,PAR  (1) 

FITAPB=. FALSE. 

DC  10  IBETA=1-NBETA 

IFfBETA (IBETA) .LT.BETALO (IBETA) )  GOTO  20 
IF]BETA (IBETA) . GT.BETAfll(IBETA)  I  GOTO  20 
10     CCNTINUE 

FITAPB=.TRUE. 
20     RETURN 

END 

SCBROUTINE  RNDCMN 
C  •  A  + 

C'     INITIALIZE  ALL  THE  VALUES  IN  THE  COMMON  AREAS  WHICH 
C      DIFINE  THE  GENERATION  OF  RANDOM  VARIABLES  AND  CON- 
C      VE3GENCE  PROCESSES. 
C  ♦  A- 

IMPLICIT  RSAL*8  (A-H,0-Z) 

CCMMON  /PDFCMN/  NPDF, IP  DP, PDFNAM , PDPLO, PDFHT, P 1  PDF, 

*  P2PDF,P3PDF,PaPDF 

CCMMON  /PSICMN/  NPSI, IPSI, PSINAH . PSILO, PSIHI, P 1 PSI, 

*  P2PST,P3PSI,P4PSI 
CCMMON  /LIMIT/   TOL,ITMAX 

C 

NPDF=1 

PCFLO=-20.D0 

PCFHI=20.D0 

IPDF=0 

P  1PDF  =  0.D0 

P2PDF=1.D0 

P3PDF=0.D0 

PUPDF=0.D0 
C 

NESI=1 

IPSI=0 

PSILO=-1,D+38 

P£IHI=+1.D+38 

P1PSI=0.D0 

P2PSI=1 .DO 

P3PSI=0.D0 

piJPSI  =  0.D0 
C 

TCL=1.0D-7 

ITMAX=1000 

RETURN 

END 


APPENDIX  F 
DATA  FOR  EXAMPLES 


For  each  of  the  ten  examples  of  Chapter  5,  the  data  used  are  listed 
on  the  pages  that  follow.  Each  data  value  is  listed  to  two  rounded  decimal 
places.  Actual  calculations  were  based  on  data  with  seventeen  significant 
figures  (FORTRAN  double  precision). 
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11.43 

2^.35 

2.87 

2  32  0. -7 

2.37 

12.07 

23.  32 

'4.41 

1503 

3.  =13 

13.17 

2  3.30 

4.43 

2  10  3.5 

3,02 

5.7  5 

4  1,89 

1.5  7 

13  9.13 

0.22 

1  2  .  ;3  B 

42.  19 

3.^3 

72,3.4  7 

4-56 

";.7  9 

3  1.72 

2,  85 

2  0R2, 9 

2.43 

n.rO 

39, 7U 

1 .  3  4 

6  5  2.86 

2,57 

11.90 

4  4.75 

0.67 

2R9.52 

6,51 

■4.03 

4c  .  f^U 

1.0- 

27  6,6  5 

3.0  8 

10.7  B 

4  7 .  r>  4 

1.14 

4  7  1.24 

2.80 

1  6  .  8 1 

2  U  ,  4  2 

9,93 

"?  a  a  f  c 

3.0  9 

3.5T 

4  ', .  3  1 

1,  1^ 

2S7,7  7 

2.19 

M  .  2  '4 

2  7  .  R  4 

2.  37 

1  •'  8  1 ,  3 

4,32 

12.6a 

2  5.05 

4,70 

2213. 3 

4,52 

12.'-. 

23-  31 

^.  3- 

2457.  1 

3,4fi 

iO.57 

2  5.52 

3,10 

3  7  0.85 

6,2  8 

3,01 

4  ?^ .  0  5 

0.87 

230.71 

1.4  3 

7.-'0 

4  7.32 

-).  5.- 

2  3  2.4a 

3,19 

1.27 

3^,01 

3,0  3 

19  0  0.1 

1.  12 

9.00 

4  1.31 

3,06 

33.94 

1.54 

n-3a 

3  1.1': 

4.19 

113  0.9 

2. '^9 

IU.2S 

2  4.52 

3.4^^ 

1  39  0 

3.54 

2  1.10 

27.01 

1.91 

12  57.  3 

3,2  1 

3.93 

4  1.74 

■) .  9  1 

2  0  7  .  h  3 

5,31 

10.35 

2  1.30 

3.  73 

2^49.4 

1,57 

15.  as 

32.54 

2.  47 

60  1.0  5 

S.12 

10-25 

2  '  .  9  5 

3.6  7 

22  3  1 

3.62 

lU.fiS 

24.71 

3.  25 

1740. 7 

7.6  6 

10.^7 

3  2.-^1 

3.  17 

1''.n7.  5 

1,7  6 

■'.30 

4  5,04 

1.21 

325.54 

2,4  8 

4 .  au 

43.56 

1.  30 

5  ^  n .  5  6 

3,M 

2,0  2 

4  1 .  1  H 

1.05 

22  0,  5  6 

1,03 

12.70 

4  4  .  1  9 

1.  2R 

4  0  0.0'' 

0.'^7 

12, 7S 

46.26 

1.  12 

15  2,01 

2.00 

12.49 

23.95 

2 .  "' ' 

-79.^  1 

7.4  0 

1  1.  114 

31.94 

T ,  7^ 

6  5  1.11 

2.  10 

13.10 

3  1.92 

1 .  ■  J  2 

2-0,95 

2-00 

11.77 

2  7 ,  7  U 

T    0  1 

76^.  7*5 

4.35 

'i  -  3  o 

21.44 

4,  54 

3  2  99. - 

3.0  1 

U.  13 

23.4? 

3,7  3 

26  3  1 

2,70 

-.13 

43.42 

1.0  3 

3  3  9.  5  5 

2.06 

2 ,  a  1 

4  5.12 

1.21 

2  4  '^  .  3  -^ 

1.13 

7.31 

•^  3  .  2  7 

4 .  4  - 

18  13.0 

2.01 

7.5  6 

2  9,  3  1 

3.43 

4  0  01.9 

2,45 

0  ,  2  2 

4  5.40 

0.90 

313.  39 

0,5  3 

13.56 

4  5^25 

0.56 

138.33 

5.14 

7.72 

41.  12 

1.73 

3  8  0.47 

10.23 

9.2a 

2B.  13 

2.72 

766. 54 

1.8  8 

3.80 

4  3.69 

2-07 

123.58 

16.71 

4.71 

47,20 

0.66 

24  3.6  9 

5,08 

Data   for  Example  1 
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a2.00      30.00       27,00       89.00 


37.00 

80.00 

27.00 

33.00 

37.00 

75.00 

25.00 

9  0,00 

28.00 

62.  00 

24.00 

37.00 

13.00 

62.00 

22.0  0 

37.00 

19.00 

62.00 

23.00 

37.00 

19.00 

5  2.00 

24.00 

9  3.00 

20,00 

62.00 

24.0  0 

93.00 

15,00 

38.00 

23.00 

37.00 

14.00 

53.00 

18.00 

80.00 

14.00 

53.00 

13.00 

89.00 

13.00 

58.00 

1^.00 

f^3.00 

11.00 

53.00 

13.00 

32.00 

12.00 

58.00 

19.00 

^3.00 

8.00 

50.00 

13.00 

39.00 

7.00 

50.00 

13.00 

86,00 

8.00 

50.00 

19.00 

72,00 

8-00 

5  0.00 

19.00 

7  0.00 

9.0  0 

50.00 

20.0  0 

8  0.00 

15,00 

56.00 

20.00 

32.0  0 

15.00 

70.0  0 

20.00 

91.00 

Data  for  Example  2 
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U.-;4 

1.00 

1.00 

n.ou 

1.00 

1.00 

n.  ii3 

1.00 

1.00 

a.  3  8 

1.00 

1.00 

5.75 

1.00 

1.00 

a.  4  3 

1.00 

1.00 

3.39 

1.00 

1.00 

5.84 

1.00 

1.00 

U.53 

1.00 

1.00 

3.5  3 

1.00 

1.00 

4.23 

1.00 

1.00 

5.91 

1.00 

1.00 

4.5  2 

1.00 

1.00 

3.31 

1-00 

1.00 

5.75 

1.00 

1.00 

3.-9 

1.00 

1.00 

3.31 

1.00 

1.00 

5.71 

1.00 

1.00 

6.25 

1.00 

1.00 

7.45 

1.00 

1.00 

6.93 

1.00 

0.00 

4.33 

1.00 

0.00 

5.35 

1.00 

0.00 

5.25 

1.00 

0.00 

5.32 

1.00 

0.00 

6.94 

1.00 

0.00 

4.75 

1.00 

0.00 

5.  13 

1.00 

0.00 

6.59 

1.00 

0.00 

5.70 

1.00 

0.0  0 

5.9  9 

1.00 

0.00 

5.57 

1.00 

0.00 

5.96 

1.00 

0.0  0 

5.91 

1.00 

0.00 

7.06 

1.00 

0.00 

7.12 

1.00 

0.00 

6.  55 

1.00 

0.00 

4.55 

1.00 

0.00 

5.6  1 

1.00 

0.00 

6.  14 

1.00 

0.00 

2.14 

0.00 

1.00 

4.20 

0.00 

1.00 

3.84 

0.00 

1.00 

1.6  0 

0.00 

1.00 

3.33 

0.0  0 

1.00 

3.27 

0.00 

1.00 

3.0  1 

0.  00 

1.00 

4.37 

0.00 

1.00 

3.51 

0.00 

1.00 

3,  17 

0.00 

1.00 

3.14 

0.00 

1.00 

1.77 

0.0  0 

1.00 

2.25 

0.00 

1.00 

3.41 

0.00 

1,00 

3.25 

0.00 

1.00 

0.82 

0.00 

1-00 

2.55 

0.00 

1-00 

1.91 

0.00 

1.00 

5.41 

0.00 

1.00 

1.65 

0.00 

1.00 

4.67 

0.00 

0.00 

3.54 

0.00 

0.00 

3.05 

0.00 

0.00 

5.33 

0.00 

0.00 

2.22 

0.0  0 

0.00 

5.57 

0.0  0 

0-00 

3.37 

0.00 

0.00 

2.69 

0.00 

0.00 

5.10 

0.0  0 

0.00 

4.52 

0.00 

0.00 

3.33 

0.00 

0.00 

3.64 

0.00 

0.00 

3.42 

0.00 

0.00 

3.61 

0.00 

0.00 

5.18 

0.00 

0.00 

3.  12 

0.00 

0.00 

5.35 

0.00 

0.00 

3-  51 

0.00 

0.00 

4.37 

0.0  0 

0.00 

4.58 

0.00 

0.00 

Data  for  Example  3 
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10.03 

1-00 

1.00 

2.75 

1.00 

1-00 

3.33 

1.00 

1.00 

7.4  5 

1.00 

1.00 

3.3  9 

1.00 

1.00 

8.36 

1.00 

1.00 

4.  34 

1.00 

1.00 

7.42 

1.00 

1.00 

7.3  0 

1.00 

1.00 

7.01 

1.00 

1.00 

3.8  8 

1.00 

1-00 

6.  14 

1.00 

1-00 

6.-9 

1.00 

1.00 

7.34 

1.00 

1.00 

6.57 

1.00 

1-OT 

12.36 

1.00 

1-  0  0 

10.33 

1.00 

1-00 

4.61 

1.00 

1-00 

12.08 

1.00 

1-00 

6.  10 

1.00 

1-00 

9.45 

1-00 

1-00 

14.39 

1.00 

1.00 

13.10 

1-00 

1.00 

10.40 

1.00 

1.0  0 

7.49 

1.00 

1.00 

8.^2 

1.00 

1-00 

4.67 

1.00 

1-00 

3.23 

1.00 

1-00 

7.16 

1.00 

1.00 

3.48 

1.00 

1-00 

7.85 

1.00 

0-00 

3.85 

1.00 

0.00 

8.05 

1.00 

0.00 

10.  18 

1.00 

0.00 

5.3  8 

1.00 

0.00 

2.47 

1.00 

0.00 

11.58 

1.00 

0.00 

6.80 

1.00 

0.00 

7.17 

1.00 

0.00 

3.05 

1.00 

0.00 

4.ti3 

1.00 

0.00 

4.79 

1.00 

0.00 

12.05 

1.00 

0.00 

6.55 

1-00 

0.00 

6.92 

1-00 

0.00 

1.53 

1.00 

0.00 

1.6a 

1.00 

0.00 

7.53 

1.00 

0.00 

3.00 

1-00 

0.00 

7.23 

1.00 

0-00 

11-55 

1.00 

0-00 

2.60 

1.0  0 

0.00 

6.94 

1.0  0 

0.00 

13.17 

1-00 

0.00 

5.72 

1.00 

0.00 

9.14 

1.0  0 

0.00 

9-60 

1.0  0 

0.00 

7-26 

1.00 

0-00 

2-07 

1.00 

0.00 

9.47 

1.0  0 

0.00 

6-  17 

0.0  0 

1.00 

9.83 

0.0  0 

1.00 

5.01 

0.0  0 

1.00 

10.71 

0.0  0 

1.00 

7.82 

0.00 

1.00 

11.69 

0.0  0 

1.00 

10.69 

0.00 

1.00 

5-89 

0.00 

1.00 

11.36 

0.00 

1-00 

10-"'4 

0.0  0 

1-00 

2-76 

0.0  0 

1.00 

0.79 

0.00 

1.00 

3-58 

0.0  0 

1.00 

2-^3 

0.00 

1.00 

0-71 

0.00 

1.00 

8.98 

0.00 

1.00 

4.00 

0-00 

1.00 

4.56 

0.00 

1.00 

5.83 

0.00 

1.00 

9-73 

0.00 

1.00 

1.  13 

0.0  0 

1,00 

3.42 

0.00 

1.00 

3.73 

0,0  0 

1.00 

3-93 

0.0  0 

1.00 

7.26 

0.0  0 

1.00 

5.21 

0.00 

1.00 

4.28 

0.0  0 

1.00 

3.02 

0.00 

1-00 

1-45 

0.0  0 

1-00 

5.52 

0,00 

1.00 

4-74 

0.00 

0.00 

9.41 

0.0  0 

0-00 

5.46 

0 .  0  0 

0,00 

6.90 

0.00 

0.00 

3.60 

0.00 

0.00 

9-06 

0.00 

0.00 

5-83 

0.00 

0.00 

6.65 

0.00 

0.00 

1.3fi 

0-00 

0-00 

2.49 

0.00 

0.00 

Data  for  Example  4 
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6.93 

0.00 

0.00 

6.  ia 

0.00 

0.00 

2.88 

0.00 

0.00 

5.59 

0.00 

0.00 

5.0  9 

0.00 

0.00 

6,76 

0.00 

0.00 

7.00 

0.00 

0.00 

a. 30 

0.00 

0.00 

6.24 

0.00 

0.00 

3.'i2 

O.OO 

0.00 

o.ou 

0.  00 

0.00 

6.^2 

0.00 

0.00 

9.59 

0.00 

0.00 

-0.10 

0.00 

0.00 

9.31 

0.00 

0.00 

3.5  0 

0.00 

0-00 

-0.20 

0.00 

0.00 

6.13 

0.00 

0.00 

1.30 

0.00 

0.00 

11.10 

0.00 

0.00 

Data  for  Example  4  (continued) 
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0.60 

-1.00 

-1.00 

0.1  = 

-1.00 

0.00 

0-51 

-1.00 

1.00 
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